Datasets for categorical data analysis

The vcdExtra package contains 47 datasets, taken from the literature on categorical data analysis, and selected to illustrate various methods of analysis and data display. These are in addition to the 33 datasets in the vcd package.

To make it easier to find those which illustrate a particular method, the datasets in vcdExtra have been classified using method tags. This vignette creates an “inverse table”, listing the datasets that apply to each method. It also illustrates a general method for classifying datasets in R packages.

library(dplyr)
library(tidyr)
library(readxl)

Processing tags

Using the result of vcdExtra::datasets(package="vcdExtra") I created a spreadsheet, vcdExtra-datasets.xlsx, and then added method tags.

dsets_tagged <- read_excel(here::here("inst", "extdata", "vcdExtra-datasets.xlsx"), 
                           sheet="vcdExtra-datasets")

dsets_tagged <- dsets_tagged |>
  dplyr::select(-Title, -dim) |>
  dplyr::rename(dataset = Item)

head(dsets_tagged)
## # A tibble: 6 × 3
##   dataset   class      tags                                
##   <chr>     <chr>      <chr>                               
## 1 Abortion  table      loglinear;logit;2x2                 
## 2 Accident  data.frame loglinear; glm; logistic            
## 3 AirCrash  data.frame reorder; ca                         
## 4 Alligator data.frame loglinear;multinomial;zeros         
## 5 Bartlett  table      2x2;loglinear; homogeneity;oddsratio
## 6 Burt      data.frame ca

To invert the table, need to split tags into separate observations, then collapse the rows for the same tag.

dset_split <- dsets_tagged |>
  tidyr::separate_longer_delim(tags, delim = ";") |>
  dplyr::mutate(tag = stringr::str_trim(tags)) |>
  dplyr::select(-tags)

#' ## collapse the rows for the same tag
tag_dset <- dset_split |>
  arrange(tag) |>
  dplyr::group_by(tag) |>
  dplyr::summarise(datasets = paste(dataset, collapse = "; ")) |> ungroup()

# get a list of the unique tags
unique(tag_dset$tag)
##  [1] "2x2"         "agree"       "binomial"    "ca"          "glm"        
##  [6] "homogeneity" "lm"          "logistic"    "logit"       "loglinear"  
## [11] "mobility"    "multinomial" "oddsratio"   "one-way"     "ordinal"    
## [16] "poisson"     "reorder"     "square"      "zeros"

Make this into a nice table

Another sheet in the spreadsheet gives a more descriptive topic for corresponding to each tag.

tags <- read_excel(here::here("inst", "extdata", "vcdExtra-datasets.xlsx"), 
                   sheet="tags")
head(tags)
## # A tibble: 6 × 2
##   tag         topic                     
##   <chr>       <chr>                     
## 1 2x2         2 by 2 tables             
## 2 agree       observer agreement        
## 3 binomial    binomial distributions    
## 4 ca          correspondence analysis   
## 5 glm         generalized linear models 
## 6 homogeneity homogeneity of association

Now, join this with the tag_dset created above.

tag_dset <- tag_dset |>
  dplyr::left_join(tags, by = "tag") |>
  dplyr::relocate(topic, .after = tag)

tag_dset |>
  dplyr::select(-tag) |>
  head()
## # A tibble: 6 × 2
##   topic                      datasets                                           
##   <chr>                      <chr>                                              
## 1 2 by 2 tables              Abortion; Bartlett; Heart                          
## 2 observer agreement         Mammograms                                         
## 3 binomial distributions     Geissler                                           
## 4 correspondence analysis    AirCrash; Burt; Draft1970table; Gilby; HospVisits;…
## 5 generalized linear models  Accident; Cormorants; DaytonSurvey; Donner; Draft1…
## 6 homogeneity of association Bartlett

Add links to `help()`

We’re almost there. It would be nice if the dataset names could be linked to their documentation. This function is designed to work with the pkgdown site. There are different ways this can be done, but what seems to work is a link to ../reference/{dataset}.html Unfortunately, this won’t work in the actual vignette.

add_links <- function(dsets, 
                      style = c("reference", "help", "rdrr.io"),
                      sep = "; ") {

  style <- match.arg(style)
  names <- stringr::str_split_1(dsets, sep)

  names <- dplyr::case_when(
    style == "help"      ~ glue::glue("[{names}](help({names}))"),
    style == "reference" ~ glue::glue("[{names}](../reference/{names}.html)"),
    style == "rdrr.io"   ~ glue::glue("[{names}](https://rdrr.io/cran/vcdExtra/man/{names}.html)")
  )  
  glue::glue_collapse(names, sep = sep)
}

Make the table

Use purrr::map() to apply add_links() to all the datasets for each tag. (mutate(datasets = add_links(datasets)) by itself doesn’t work.)

tag_dset |>
  dplyr::select(-tag) |>
  dplyr::mutate(datasets = purrr::map(datasets, add_links)) |>
  knitr::kable()

topic	datasets
2 by 2 tables	Abortion; Bartlett; Heart
observer agreement	Mammograms
binomial distributions	Geissler
correspondence analysis	AirCrash; Burt; Draft1970table; Gilby; HospVisits; HouseTasks; Mental
generalized linear models	Accident; Cormorants; DaytonSurvey; Donner; Draft1970table; GSS; ICU; PhdPubs
homogeneity of association	Bartlett
linear models	Draft1970
logistic regression	Accident; Donner; ICU; Titanicp
logit models	Abortion; Cancer
loglinear models	Abortion; Accident; Alligator; Bartlett; Caesar; Cancer; Detergent; Dyke; Heckman; Hoyt; JobSat; Mice; TV; Titanicp; Toxaemia; Vietnam; Vote1980; WorkerSat
mobility tables	Glass; Hauser79; Mobility; Yamaguchi87
multinomial models	Alligator
odds ratios	Bartlett; Fungicide
one-way tables	CyclingDeaths; Depends; ShakeWords
ordinal variables	Draft1970table; Gilby; HairEyePlace; Hauser79; HospVisits; JobSat; Mammograms; Mental; Mice; Mobility; Yamaguchi87
Poisson distributions	Cormorants; PhdPubs
reordering values	AirCrash; Glass; HouseTasks
square tables	Glass; Hauser79; Mobility; Yamaguchi87
zero counts	Alligator; Caesar; PhdPubs; Vote1980

Voila!

Datasets for categorical data analysis

Processing tags

Make this into a nice table

Add links to help()

Make the table

Add links to `help()`