Frequency tables need some Tidy Love ❤️
Tidy methods for quantitative data & models have advanced considerably, but there hasn’t been much development of similar ideas for “categorical data”, by which I mean data that is often compactly represented as n-way frequency tables, cross classified by one or more discrete factors.
What would it take to implement a tidy framework for such data? These
notes are, in effect, a call for participation in developing a
tidyCat
package for this purpose. Other possible names for
this: tidyCDA
, tidyfreq
…
I see three areas that could be developed here:
Current non-tidy data forms and operations, following (Friendly & Meyer, 2016) are described in the vignette Creating and manipulating frequency tables
It seems clear that the most flexible and general form, and one that most closely matches a tidy data frame is case form, because this allows for numeric variables as well. Thus:
Among these, tidy categorical data should best be represented as a
tibble
in case form. A tibble is convenient for its’ printing.
The methods xtabs()
, table()
and
expand.dft()
described in that vignette allow conversion
from one form to another. Tidy equivalents might be:
as_table()
, as_matrix()
,
as_array()
to convert from any form to table/array
form
Similarly, perhaps as_caseform()
,
as_freqform()
to convert to those. There is already
as.data.frame(table)
to convert to frequency form, and
expand.dft()
converts that to case form
data("HairEyeColor")
hec.df <- as.data.frame(HairEyeColor)
head(hec.df)
## Hair Eye Sex Freq
## 1 Black Brown Male 32
## 2 Brown Brown Male 53
## 3 Red Brown Male 10
## 4 Blond Brown Male 3
## 5 Black Blue Male 11
## 6 Brown Blue Male 50
# expand to case form
expand.dft(hec.df) |> head()
## Hair Eye Sex
## 1 Black Brown Male
## 2 Black Brown Male
## 3 Black Brown Male
## 4 Black Brown Male
## 5 Black Brown Male
## 6 Black Brown Male
vcd::structable()
produces a ‘flat’ representation of a
high-dimensional contingency table constructed by recursive splits
(similar to the construction of mosaic displays). One can be constructed
from a table or from a data frame with a formula method,structable(Titanic)
## Sex Male Female
## Survived No Yes No Yes
## Class Age
## 1st Child 0 5 0 1
## Adult 118 57 4 140
## 2nd Child 0 11 0 13
## Adult 154 14 13 80
## 3rd Child 35 13 17 14
## Adult 387 75 89 76
## Crew Child 0 0 0 0
## Adult 670 192 3 20
structable(Sex + Class ~ Survived + Age, data = Titanic)
## Sex Male Female
## Class 1st 2nd 3rd Crew 1st 2nd 3rd Crew
## Survived Age
## No Child 0 0 35 0 0 0 17 0
## Adult 118 154 387 670 4 13 89 3
## Yes Child 5 11 13 0 1 13 14 0
## Adult 57 14 75 192 140 80 76 20
and there are a suite of methods for indexing and selecting parts of an n-way table.
The methods in the plyr
package (now retired)
provided a coherent set of tools for a split-apply-combine strategy that
works nicely with multidimensional arrays. Perhaps there are some useful
ideas for frequency tables that could be resurrected here.
There is also a role for purrr
methods and thinking
here: n-way tables as nested
lists/arrays? The ideas of mapping over these?
Also needed:
methods for recoding and collapsing the levels
of a factor: forcats::fct_recode()
,
forcats::fct_collapse()
,
forcats::fct_lump_min()
are useful here.
methods for reordering the levels of a factor, either manually or for some analysis purpose. For example, Data from Glass (1954) gave this 5 x 5 table on the occupations of 3500 British fathers and their sons, where the occupational categories are listed in alphabetic order.
data(Glass, package="vcdExtra")
str(Glass)
## 'data.frame': 25 obs. of 3 variables:
## $ father: Factor w/ 5 levels "Managerial","Professional",..: 2 2 2 2 2 1 1 1 1 1 ...
## $ son : Factor w/ 5 levels "Managerial","Professional",..: 2 1 4 3 5 2 1 4 3 5 ...
## $ Freq : int 50 45 8 18 8 28 174 84 154 55 ...
(glass.tab <- xtabs(Freq ~ father + son, data=Glass))
## son
## father Managerial Professional Skilled Supervisory Unskilled
## Managerial 174 28 154 84 55
## Professional 45 50 18 8 8
## Skilled 150 14 714 185 447
## Supervisory 78 11 223 110 96
## Unskilled 42 3 320 72 411
This can be reordered manually by indexing, to arrange the categories
by status, giving an order Professional
down to Unskilled
:
# reorder by status
ord <- c(2, 1, 4, 3, 5)
glass.tab[ord, ord]
## son
## father Professional Managerial Supervisory Skilled Unskilled
## Professional 50 45 8 18 8
## Managerial 28 174 84 154 55
## Supervisory 11 78 110 223 96
## Skilled 14 150 185 714 447
## Unskilled 3 42 72 320 411
A more general method is to permute the row and column categories in
the order implied by correspondence analysis dimensions. This is
implemented in the seriation
package using the CA
method of
seriation::seriate()
and applying permute()
to
the result.
library(seriation)
order <- seriate(glass.tab, method = "CA")
# the permuted row and column labels
rownames(glass.tab)[order[[1]]]
## [1] "Professional" "Managerial" "Supervisory" "Skilled" "Unskilled"
# reorder rows and columns
permute(glass.tab, order)
## son
## father Professional Managerial Supervisory Skilled Unskilled
## Professional 50 45 8 18 8
## Managerial 28 174 84 154 55
## Supervisory 11 78 110 223 96
## Skilled 14 150 185 714 447
## Unskilled 3 42 72 320 411
What are tidy ways to do these things?
The standard analysis of frequency data is in the form of loglinear
models fit by MASS::loglm()
or with more flexible versions
fit with glm()
or gnm::gnm()
. These are
essentially linear models for the log of frequency. In
vcdExtra
, there are several methods for a list of such
models, of class "glmlist"
and "loglmlist"
,
and these should be accommodated in tidy methods.
What is needed are broom
methods for loglm
models. The information required is accessible from standard functions,
but not in a tidy form.
glance.loglm()
– model level
statistics. These are given in the output of the print()
method, and available from the print()
method. A
complication is both LR and Pearson χ2 are reported, so these
would need to be made to appear in separate columns. There are also
related LRstats()
functions in vcdExtra
, which
report AIC
and BIC
.hec.indep <- loglm(~Hair+Eye+Sex, data=HairEyeColor)
hec.indep
## Call:
## loglm(formula = ~Hair + Eye + Sex, data = HairEyeColor)
##
## Statistics:
## X^2 df P(> X^2)
## Likelihood Ratio 166.3001 24 0
## Pearson 164.9247 24 0
# extract test statistics
summary(hec.indep)$tests
## X^2 df P(> X^2)
## Likelihood Ratio 166.3001 24 0
## Pearson 164.9247 24 0
LRstats(hec.indep)
## Likelihood summary table:
## AIC BIC LR Chisq Df Pr(>Chisq)
## hec.indep 321.18 332.9 166.3 24 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tidy.loglm()
— coefficient level
statistics. These are available from coef.loglm()
. They
would need to assembled into a long format. Standard errors &
p-values might be a problem.coef(hec.indep)
## $`(Intercept)`
## [1] 2.646879
##
## $Hair
## Black Brown Red Blond
## -0.17911627 0.79474431 -0.59856762 -0.01706041
##
## $Eye
## Brown Blue Hazel Green
## 0.5296905 0.5067010 -0.3313375 -0.7050540
##
## $Sex
## Male Female
## -0.0574957 0.0574957
augment.loglm()
— should give case
level statistics: fitted values, residuals, …fitted(hec.indep)
## Re-fitting to get fitted values
## , , Sex = Male
##
## Eye
## Hair Brown Blue Hazel Green
## Black 18.91504 18.48515 7.995903 5.502557
## Brown 50.08982 48.95142 21.174335 14.571585
## Red 12.43489 12.15228 5.256566 3.617421
## Blond 22.24268 21.73717 9.402589 6.470599
##
## , , Sex = Female
##
## Eye
## Hair Brown Blue Hazel Green
## Black 21.22010 20.73782 8.970314 6.173119
## Brown 56.19396 54.91682 23.754719 16.347334
## Red 13.95025 13.63320 5.897151 4.058254
## Blond 24.95326 24.38614 10.548424 7.259131
residuals(hec.indep)
## Re-fitting to get frequencies and fitted values
## , , Sex = Male
##
## Eye
## Hair Brown Blue Hazel Green
## Black 2.7349421 -1.8843321 0.6818522 -1.1685504
## Brown 0.4073036 0.1493413 0.8080654 0.1116873
## Red -0.7150915 -0.6371217 0.7233138 1.5738248
## Blond -5.1444108 1.6747424 -1.5778804 0.5796238
##
## , , Sex = Female
##
## Eye
## Hair Brown Blue Hazel Green
## Black 2.9150045 -2.9069608 -1.4476878 -1.9590840
## Brown 1.2726041 -3.0381608 1.0398490 -0.5953638
## Red 0.5361173 -1.9834384 0.4409912 1.3223880
## Blond -5.2211917 6.6539759 -1.9056398 0.2704898
What about hatvalues
? Not implemented, but shouldn’t be
too hard.
The most common graphical methods are those implemented in
vcd
: mosaic()
association plots
(assoc()
), …, which rely on vcd::strucplot()
described in (Meyer, Zeileis, & Hornik,
2006).
Is there a tidy analog that might work with ggplot2
? The
ggmosaic
package implements basic marimeko-style mosaic plots. They are not
very general, in that they cannot do residual-based shading to show the
patterns of association.
However, they are based on a productplots package by Hadley, which seems to provide some basic structure for constructing such displays of nested rectangles.