---
title: "1a. Steps Toward Tidy Categorical Data Analysis"
subtitle: "May the Forms Be with You: Novel Functions to Intuitively Convert Among Forms and Collapse Variable Levels Presented Using the `starwars` Data."
author: "Gavin M. Klorfine"
output: rmarkdown::html_vignette
package: vcdExtra
vignette: >
%\VignetteIndexEntry{1a. Steps Toward Tidy Categorical Data Analysis}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
message = FALSE,
warning = FALSE,
fig.height = 6,
fig.width = 7,
dev = "png",
comment = "##"
)
library(vcdExtra)
library(dplyr)
library(tidyr)
```
# Overview
While R provides many intuitive facilities for the manipulation of continuous
variables (such as those in the
[`tidyverse`](https://CRAN.R-project.org/package=tidyverse) collection of
packages), it somewhat lacks the equivalent for categorical data. Two such areas
include the collapsing of variable levels (e.g., combining hair
colours of "Brown" and "Black" into a "Dark" category) and the conversion
between forms of categorical data (e.g., from a `table` of entries to a
`data.frame` containing frequencies for each combination of variable levels).
## Tidy Collapsing
In R, when trying to collapse levels of a variable in a dataset (e.g., combining
hair colours of "Brown" and "Black" into a "Dark" category), it was often the
case that one would need to first convert amongst forms, "collapse" their data, aggregate the duplicate rows, and finally convert back to the initial form.
`collapse_levels()` simplifies this process, allowing for the intuitive
collapsing of variable levels for datasets of any form. One just needs to ensure
that an argument of `freq = "the frequency column name"` is supplied when the
inputted dataset is in frequency form.
Functionality of `collapse_levels()` is demonstrated below
using the `starwars` data from the
[`dplyr`](https://CRAN.R-project.org/package=dplyr) package. This dataset
contains case form data on various characters in the Star Wars franchise.
Variables considered in this vignette are a character's `hair_color`,
`skin_color`, and `eye_color`. Taken as is, this would correspond to an
$11 \times 28 \times 15$ contingency table... Time to collapse!
Here I load the `starwars` data and select the variables of interest. For
simplicity, I then remove rows containing `NA` values.
```{r overoll_loadselect}
data("starwars", package = "dplyr")
star_case <- starwars |>
dplyr::select(c("hair_color", "skin_color", "eye_color")) |>
tidyr::drop_na()
str(star_case)
```
First, taking a look at the levels of variable `hair_color`, there are many
ways one might want to collapse these categories:
```{r overcoll_hairunique}
unique(star_case$hair_color)
```
***Example***:
Likely the most natural of these ways is the following:
1. Collapse different spellings of `"blond"` (i.e., `"blond"` and `"blonde"` become `Blonde`).
1. Collapse different shades of `"brown"` (i.e., `"brown"` and `"brown, grey"` become `Brown`).
1. Collapse different shades of `"auburn"` (i.e., `"auburn, white"`, `"auburn, grey"`, and `"auburn"` become `Auburn`).
1. Keep `"none"` as-is.
1. Keep `"white"` as-is.
1. Keep `"grey"` as-is.
1. Keep `"Black"` as-is.
Here is how to do this using `collapse_levels()`:
```{r overcoll_ex1}
collapsed.star_case <- collapse_levels(
star_case, # The dataset
hair_color = list( # Assign the variable to be collapsed to a list
# Format the list as NewLevel = c("old1", "old2", ..., "oldn")
Blonde = c("blond", "blonde"),
Brown = c("brown", "brown, grey"),
Auburn = c("auburn, white", "auburn, grey", "auburn")
)
)
str(collapsed.star_case)
unique(collapsed.star_case$hair_color)
```
Second, one might also want to collapse levels of variable `skin_color`:
```{r overcoll_skinunique}
unique(star_case$skin_color)
```
***Example***:
I decided to arbitrarily collapse these as follows:
1. Keep `"none"` as-is.
1. Keep`"unknown"` as-is.
1. `Shades`, comprising all levels that begin with `"white"`, `"grey"`, `"dark"`, `"light"`, and `"fair"`.
1. `Rainbows`, comprising all other levels.
Note that when working with real data, arbitrary decisions involving the
collapsing of variable levels are a *VERY* bad idea. Collapses should be
grounded in strong, data-driven justification. Arbitrary collapsing is employed
in this vignette purely for pedagogical and illustrative purposes.
```{r overcoll_ex2}
collapsed.star_case <- collapse_levels(
collapsed.star_case,
skin_color = list(
Shades = c(
"fair", "white", "light", "dark", "grey", "grey, red",
"grey, blue", "white, blue", "grey, green, yellow", "fair, green, yellow"
),
Rainbows = c(
"green", "pale", "metal", "brown mottle", "brown", "mottled green",
"orange", "blue, grey", "red", "blue", "yellow", "tan", "silver, red",
"green, grey", "red, blue, white", "brown, white"
)
)
)
str(collapsed.star_case)
unique(collapsed.star_case$skin_color)
```
Third, one may also want to collapse levels of variable `eye_color`:
```{r overcoll_eyeunique}
unique(star_case$eye_color)
```
***Example***:
Again, I decided to arbitrarily collapse these as follows:
1. `Normal`, with levels of typical human eye color (i.e., `"blue"`, `"blue-gray"`, `"brown"`, `"hazel"`, and `"dark"`).
1. `Abnormal`, with levels of eye colours that would be abnormal for humans (e.g., `"red"`, `"pink"`, `"gold"`, etc.).
1. Keep `unknown` as-is.
```{r overcoll_ex3}
collapsed.star_case <- collapse_levels(
collapsed.star_case,
eye_color = list(
Normal = c("blue", "brown", "blue-gray", "hazel", "dark"),
Abnormal = c(
"yellow", "red", "orange", "black", "pink", "red, blue", "gold",
"green, yellow", "white"
)
)
)
str(collapsed.star_case)
unique(collapsed.star_case$eye_color)
```
In addition, one may want (and is able) to collapse levels of multiple variables
in a single call to `collapse_levels()`.
***Example***:
To illustrate this (and to provide an easy working example for the following
"Tidy Conversions" section), the `collapsed.star_case` data is arbitrarily
collapsed as follows to correspond to a $3 \times 3 \times 3$ contingency table:
1. Variable `hair_color`:
a. `Dark` corresponding to levels `"Brown"`, `"black"`, and `"Auburn"`.
b. `Light` corresponding to levels `"Blonde"`, `"white"`, and `"grey"`.
c. Keep `"none"` as-is.
1. Variable `skin_color`:
a. `Other` corresponding to levels `"none"` and `"unknown"`.
b. Keep `Rainbows` as-is.
c. Keep `Shades` as-is.
1. Variable `eye_color` kept as-is.
```{r overcoll_ex4}
collapsed.star_case <- collapse_levels(
collapsed.star_case,
hair_color = list( # First variable
Dark = c("Brown", "black", "Auburn"),
Light = c("Blonde", "white", "grey")
),
skin_color = list( # Second variable
Other = c("none", "unknown")
)
)
unique(collapsed.star_case$hair_color)
unique(collapsed.star_case$skin_color)
str(collapsed.star_case)
```
## Tidy Conversions
Until now, converting amongst forms of categorical data in R has been somewhat
onerous. As outlined in
[1. Creating and manipulating frequency tables]( a1-creating.html),
the below table shows the typical process for converting among forms
(`A`, `B`, and `C` represent categorical variables, `X` represents an R data
object):
| **From this** | | **To this** | |
|:-----------------|:--------------------|:---------------------|-------------------|
| | _Case form_ | _Frequency form_ | _Table form_ |
| _Case form_ | noop | `xtabs(~A+B)` | `table(A,B)` |
| _Frequency form_ | `expand.dft(X)` | noop | `xtabs(count~A+B)`|
| _Table form_ | `expand.dft(X)` | `as.data.frame(X)` | noop |
Instead, one may simply use `as_table(X)` to convert to table form,
`as_freqform(X)` to convert to frequency form, and `as_caseform(X)` to convert
to case form. These are illustrated in the network (node/edge) diagram below:
Additionally, there are functions `as_array(X)` and `as_matrix(X)`
for converting to those respective types.
Like `collapse_levels()`, the single thing to keep in mind when employing these functions is the following:
when your object `X` is in frequency form, an argument of
`freq = "your frequency column name"` must be supplied. Besides this, the rote
memory work of having to remember which function to use to convert form X to
form Y is now completely removed.
Functionality of these "tidy" conversion functions are demonstrated below
using the `collapsed.star_case` data from the most recent example (i.e., the
data corresponding to a $3 \times 3 \times 3$ contingency table).
***Example***:
Convert the `collapsed.star_case` data into frequency form. Name this data
`star_freqform`.
```{r overconv-ex1}
star_freqform <- as_freqform(collapsed.star_case)
str(star_freqform)
```
Note that if one would like a data frame instead of a tibble, an argument of
`tidy = FALSE` needs to be provided. Naturally, this `tidy` argument is present
only in functions `as_freqform()` and `as_caseform()`.
***Example***:
Convert the `collapsed.star_case` data into a data frame in frequency form.
```{r overconv-ex2}
as_freqform(collapsed.star_case, tidy = FALSE) |> str()
```
***Example***:
Convert the frequency form data, `star_freqform`, into table form. Name this
data `star_tab`. Because we are converting *from* frequency form, the
`freq = "frequency column name"` argument must be supplied.
```{r overconv-ex3}
star_tab <- as_table(star_freqform, freq = "Freq")
str(star_tab)
```
***Example***:
Convert the table form data, `star_tab`, into an array. Name this
data `star_array`.
```{r overconv-ex4}
star_array <- as_array(star_tab)
class(star_array)
str(star_array)
```
To convert to a matrix, one also needs to specify row and column dimensions.
This is done using the `dims = c("dim1", "dim2", ..., "dim_n")` argument, which
works by summing over the dimensions excluded from this call. The first provided
dimension is taken as the row dimension, with the second dimension taken as the
column dimension.
***Example***:
Convert the array form data, `star_array`, into a matrix with dimensions
`"hair_color"` and `"eye_color"`. Name this data `star_mat`.
```{r overconv-ex5}
star_mat <- as_matrix(star_array, dims = c("hair_color", "eye_color"))
class(star_mat)
str(star_mat)
```
Note that the `dims` argument works the same way for all other tidy conversion
functions.
***Example***:
Convert the table form data, `star_tab`, into frequency form with dimensions
`"hair_color"` and `"eye_color"`.
```{r overconv-ex6}
as_freqform(star_tab, dims = c("hair_color", "eye_color")) |> str()
```
#### Proportions
The last piece of these conversion functions is the `prop` argument, allowing
users to convert cells/frequencies to proportions. Calculated proportions may
either be relative to the grand total (`prop = TRUE`) or to one or more margins
(`prop = c("margin1", "margin2", ... "margin_n")`).
Note that `as_caseform()` is the only of the tidy conversion functions to not
include a `prop` argument. Also, `as_caseform()` will not convert proportional
data.^[This was a deliberate choice, as once proportions are relative to
margins, it becomes unclear how to convert these proportions back to
the original entries.]
***Example***:
Convert `star_mat` into a table of proportions that are relative to the grand
total.
```{r propconv-ex1}
star_mat # To view the original
as_table(star_mat, prop = TRUE)
```
***Example***:
Convert `star_mat` into a table of proportions that are relative to the marginal
sums of `hair_color`.
```{r propconv-ex2}
as_table(star_mat, prop = "hair_color")
```
***Example***:
Convert `star_mat` into a table of proportions that are relative to the marginal
sums of both `hair_color` and `eye_color`. Since these are the only two
dimensions, cell proportions will all be equal to $1.0$ (except for cells where
no data exists).
```{r propconv-ex3}
as_table(star_mat, prop = c("hair_color", "eye_color"))
```
# Taken Together
Taking `collapse_levels()` and the tidy conversion functions together, one now
has an intuitive framework for manipulating categorical data.
***Example***:
The `starwars` data also has a variable named `homeworld`, specifying the planet
that a given character was from. The below code does the following:
1. Create data `home_star` from dataset `starwars`. The new data includes both `homeworld` and the previous variables of interest (`hair_color`, `eye_color`, and `skin_color`). Missing values are then omitted.
1. Sort `homeworld` alphabetically.
1. Collapse the first half of the sorted `homeworld`s into a level named `abc`.
1. Collapse the second half of the sorted `homeworld`s into a level named `xyz`.
1. Collapse `eye_color` according to the previous `Abnormal`, `Normal`, and `"unknown"` conventions.
1. Convert the collapsed data into a table with dimensions `homeworld` and
`eye_color`. Call this table `tab.home_star` and plot the result in a mosaic
display.
1. Convert `tab.home_star` into a matrix of proportions (relative to the grand total).
```{r tt-ex1}
home_star <- starwars |>
dplyr::select(c("hair_color", "skin_color", "eye_color", "homeworld")) |>
tidyr::drop_na()
# Sort unique levels of homeworld
lvls <- home_star$homeworld |> unique() |> sort()
lvls
# Collapse variable levels
collapsed.home_star <- collapse_levels(
home_star,
homeworld = list(
abc = lvls[1:(length(lvls)/2)],
xyz = lvls[(length(lvls)/2 + 1):length(lvls)]
),
eye_color = list(
Normal = c("blue", "brown", "blue-gray", "hazel", "dark"),
Abnormal = c(
"yellow", "red", "orange", "black", "pink", "red, blue", "gold",
"green, yellow", "white"
)
)
)
# Convert to table of dimensions 'homeworld' and 'eye_color'
tab.home_star <- as_table(collapsed.home_star, dims = c("homeworld", "eye_color"))
# Plot as mosaic display
mosaic(tab.home_star, shading = TRUE, gp = shading_Friendly)
# Convert table into matrix of proportions. Note argument 'dims' was not supplied
# as we already know that there are exactly 2 dimensions.
as_matrix(tab.home_star, prop = TRUE)
```
Thus, this constitutes a pipeline for working with categorical data:
1. Gather data and clean it.
1. Collapse levels when substantively necessary.
1. Convert forms, select dimensions, and/or take proportions if necessary.
```{r ttpipeline, eval=FALSE}
dataset |> # Gather the data
select(...) |> drop_na() |> ... |> # Clean the data
collapse_levels(...) |> # Collapse levels as necessary
as_form(...) # Convert forms, select dimensions, take proportions
```
When viewed this way, these functions appear to be the start of a grammar of
categorical data analysis.