Categorical Association Measures

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

The moderncor_cat() function provides a unified interface for computing association measures between categorical (factor) variables. All measures require the DescTools package.

Basic Usage

moderncor_cat() accepts two factor (or character/numeric-as-categorical) vectors:

set.seed(42)
x <- factor(sample(c("A", "B", "C"), 100, replace = TRUE))
y <- factor(sample(c("X", "Y"), 100, replace = TRUE))

moderncor_cat(x, y, method = "cramers_v")
#> 
#>    Cramer's V 
#> 
#>   Estimate:  0.0173
#>   Statistic: 0.03
#>   P-value:   0.9851
#>   Sample size (n): 100

The output is an S3 object of class "moderncor_cat" with the same structure as moderncor() output:

$estimate: the association coefficient
$statistic: the chi-square test statistic (for nominal methods)
$p.value: the p-value (for nominal methods; NULL for ordinal methods)
$n: the sample size
$method_label: human-readable method name

Querying Available Methods

available_methods_cat()
#>        method                   label   package    type
#> 1   cramers_v              Cramer's V DescTools nominal
#> 2         phi         Phi Coefficient DescTools nominal
#> 3       gamma   Goodman-Kruskal Gamma DescTools ordinal
#> 4    somers_d               Somers' D DescTools ordinal
#> 5 contingency Contingency Coefficient DescTools nominal
#> 6   tschuprow           Tschuprow's T DescTools nominal

Methods fall into two categories:

Nominal: for unordered categories (Cramér’s V, Phi, Contingency Coefficient, Tschuprow’s T)
Ordinal: for ordered categories (Goodman-Kruskal Gamma, Somers’ D)

Nominal Association Measures

Nominal measures are appropriate when categories have no natural ordering. They are all based on the chi-square statistic and return a p-value.

Cramér’s V

Cramér’s V is the most widely used measure of nominal association. It ranges from 0 (no association) to 1 (perfect association) and is symmetric:

moderncor_cat(x, y, method = "cramers_v")
#> 
#>    Cramer's V 
#> 
#>   Estimate:  0.0173
#>   Statistic: 0.03
#>   P-value:   0.9851
#>   Sample size (n): 100

For a 2×2 table, Cramér’s V equals the absolute value of the Phi coefficient.

Phi Coefficient

The Phi coefficient is designed for 2×2 contingency tables. For larger tables it can exceed 1, so prefer Cramér’s V in that case:

x_bin <- factor(sample(c("Yes", "No"), 100, replace = TRUE))
y_bin <- factor(sample(c("Pass", "Fail"), 100, replace = TRUE))

moderncor_cat(x_bin, y_bin, method = "phi")
#> 
#>    Phi Coefficient 
#> 
#>   Estimate:  0.1386
#>   Statistic: 1.9218
#>   P-value:   0.1657
#>   Sample size (n): 100

Contingency Coefficient

The contingency coefficient (Pearson’s C) is bounded between 0 and $\sqrt{(k-1)/k}$ where $k$ is the number of categories, so it is not comparable across tables of different sizes:

moderncor_cat(x, y, method = "contingency")
#> 
#>    Contingency Coefficient 
#> 
#>   Estimate:  0.0173
#>   Statistic: 0.03
#>   P-value:   0.9851
#>   Sample size (n): 100

Tschuprow’s T

Tschuprow’s T is similar to Cramér’s V but uses the geometric mean of the marginal category counts as its normalizer. It is symmetric and ranges from 0 to 1:

moderncor_cat(x, y, method = "tschuprow")
#> 
#>    Tschuprow's T 
#> 
#>   Estimate:  0.0146
#>   Statistic: 0.03
#>   P-value:   0.9851
#>   Sample size (n): 100

Ordinal Association Measures

Ordinal measures are appropriate when categories have a natural ordering (e.g., Likert scales, severity grades). They do not return p-values by default.

Goodman-Kruskal Gamma

Goodman-Kruskal Gamma ($\gamma$) measures the tendency for pairs of observations to be concordant (both variables increase together) vs. discordant. It ranges from −1 to 1 and is symmetric:

# Simulate ordinal survey data
set.seed(1)
quality  <- factor(sample(c("Low", "Medium", "High"), 100, replace = TRUE,
                           prob = c(0.3, 0.4, 0.3)),
                   levels = c("Low", "Medium", "High"), ordered = TRUE)
satisfaction <- factor(sample(c("Dissatisfied", "Neutral", "Satisfied"), 100,
                               replace = TRUE, prob = c(0.3, 0.4, 0.3)),
                       levels = c("Dissatisfied", "Neutral", "Satisfied"), ordered = TRUE)

moderncor_cat(quality, satisfaction, method = "gamma")
#> 
#>    Goodman-Kruskal Gamma 
#> 
#>   Estimate:  0.0808
#>   Sample size (n): 100

Somers’ D

Somers’ D is an asymmetric ordinal measure: it measures the predictability of y from x (but not vice versa). Values range from −1 to 1:

moderncor_cat(quality, satisfaction, method = "somers_d")
#> 
#>    Somers' D 
#> 
#>   Estimate:  0.0548
#>   Sample size (n): 100

Note that swapping x and y gives a different result:

moderncor_cat(satisfaction, quality, method = "somers_d")
#> 
#>    Somers' D 
#> 
#>   Estimate:  0.0549
#>   Sample size (n): 100

Pairwise Matrix for Multiple Variables

Pass a data.frame of factor columns to compute pairwise associations across all pairs:

df <- data.frame(
  cyl   = factor(mtcars$cyl),
  gear  = factor(mtcars$gear),
  am    = factor(mtcars$am)
)

res_mat <- moderncor_cat(df, method = "cramers_v")
res_mat
#> 
#>    Cramer's V 
#> 
#>   Association Matrix (n = 32):
#> 
#>         cyl   gear     am
#> cyl  1.0000 0.5309 0.5226
#> gear 0.5309 1.0000 0.8090
#> am   0.5226 0.8090 1.0000
#> 
#>   P-value Matrix:
#> 
#>         cyl   gear     am
#> cyl  0.0000 0.0012 0.0126
#> gear 0.0012 0.0000 0.0000
#> am   0.0126 0.0000 0.0000

The result is a matrix of association coefficients. For nominal methods, the associated p-value matrix is also stored in $p.value:

res_mat$p.value
#>              cyl         gear           am
#> cyl  0.000000000 1.214066e-03 1.264661e-02
#> gear 0.001214066 0.000000e+00 2.830889e-05
#> am   0.012646605 2.830889e-05 0.000000e+00

Use as.data.frame() to convert to tidy format:

as.data.frame(res_mat)
#>   var1 var2 association      p.value
#> 1 gear  cyl   0.5308655 1.214066e-03
#> 2   am  cyl   0.5226355 1.264661e-02
#> 3  cyl gear   0.5308655 1.214066e-03
#> 4   am gear   0.8090247 2.830889e-05
#> 5  cyl   am   0.5226355 1.264661e-02
#> 6 gear   am   0.8090247 2.830889e-05

Handling Missing Values

The use argument controls how missing values are handled, mirroring the interface of moderncor():

"complete.obs" (default): remove all rows with any NA before computing
"pairwise.complete.obs": remove NAs per pair
"everything": propagate NAs (returns NA for any pair with missing values)

x_na <- factor(c("A", "B", NA, "A", "B", "C"))
y_na <- factor(c("X", "Y", "X", NA, "Y", "X"))

moderncor_cat(x_na, y_na, method = "cramers_v", use = "complete.obs")
#> 
#>    Cramer's V 
#> 
#>   Estimate:  1
#>   Statistic: 4
#>   P-value:   0.1353
#>   Sample size (n): 4

Choosing the Right Method

Situation	Recommended method
Two unordered categorical variables (general)	`cramers_v`
Two binary variables (2×2 table)	`phi`
Two ordered categorical (Likert) variables	`gamma`
Predicting one ordered variable from another	`somers_d`
Comparing association across different table sizes	`cramers_v` or `tschuprow`

For continuous variables, use moderncor() instead. See vignette("introduction") for a full overview.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.