Getting started with spicy

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

spicy is an R package for descriptive statistics and data analysis, designed for data science and survey research workflows. It covers variable inspection, frequency tables, cross-tabulations with chi-squared tests and effect sizes, and publication-ready summary tables, offering functionality similar to Stata or SPSS but within a tidyverse-friendly R environment. This vignette walks through the core workflow using the bundled sochealth dataset, a simulated social-health survey with 1200 respondents and 24 variables.

Inspect your data

varlist() (or its shortcut vl()) gives a compact overview of every variable in a data frame: name, label, representative values, class, number of distinct values, valid observations, and missing values. In RStudio or Positron, calling varlist() without arguments opens an interactive viewer - this is the most common usage in practice. Here we use tbl = TRUE to produce static output for the vignette:

varlist(sochealth, tbl = TRUE)
#> # A tibble: 24 × 7
#>    Variable          Label                 Values Class N_distinct N_valid   NAs
#>    <chr>             <chr>                 <chr>  <chr>      <int>   <int> <int>
#>  1 sex               Sex                   Femal… fact…          2    1200     0
#>  2 age               Age (years)           25, 2… nume…         51    1200     0
#>  3 age_group         Age group             25-34… orde…          4    1200     0
#>  4 education         Highest education le… Lower… orde…          3    1200     0
#>  5 social_class      Subjective social cl… Lower… orde…          5    1200     0
#>  6 region            Region of residence   Centr… fact…          6    1200     0
#>  7 employment_status Employment status     Emplo… fact…          4    1200     0
#>  8 income_group      Household income gro… Low, … orde…          4    1182    18
#>  9 income            Monthly household in… 1000,… nume…       1052    1200     0
#> 10 smoking           Current smoker        No, Y… fact…          2    1175    25
#> # ℹ 14 more rows

You can also select specific columns with tidyselect syntax:

varlist(sochealth, starts_with("bmi"), income, weight, tbl = TRUE)
#> # A tibble: 4 × 7
#>   Variable     Label                       Values Class N_distinct N_valid   NAs
#>   <chr>        <chr>                       <chr>  <chr>      <int>   <int> <int>
#> 1 bmi          Body mass index             16, 1… nume…        177    1188    12
#> 2 bmi_category BMI category                Norma… orde…          3    1188    12
#> 3 income       Monthly household income (… 1000,… nume…       1052    1200     0
#> 4 weight       Survey design weight        0.294… nume…        794    1200     0

Frequency tables

freq() produces frequency tables with counts, percentages, and (optionally) valid and cumulative percentages.

freq(sochealth, education)
#> Frequency table: education
#> 
#>  Category   │ Values               Freq.    Percent 
#> ────────────┼───────────────────────────────────────
#>  Valid      │ Lower secondary        261       21.8 
#>             │ Upper secondary        539       44.9 
#>             │ Tertiary               400       33.3 
#> ────────────┼───────────────────────────────────────
#>  Total      │                       1200      100.0 
#> 
#> Label: Highest education level
#> Class: ordered, factor
#> Data: sochealth

Weighted frequencies use the weights argument. With rescale = TRUE, the total weighted N matches the unweighted N:

freq(sochealth, education, weights = weight, rescale = TRUE)
#> Frequency table: education
#> 
#>  Category   │ Values               Freq.    Percent 
#> ────────────┼───────────────────────────────────────
#>  Valid      │ Lower secondary        259       21.6 
#>             │ Upper secondary        546       45.5 
#>             │ Tertiary               395       32.9 
#> ────────────┼───────────────────────────────────────
#>  Total      │                       1200      100.0 
#> 
#> Label: Highest education level
#> Class: ordered, factor
#> Data: sochealth
#> Weight: weight (rescaled)

Cross-tabulations

cross_tab() crosses two categorical variables. By default it shows counts, a chi-squared test, and Cramer’s V:

cross_tab(sochealth, smoking, education)
#> Crosstable: smoking x education (N)
#> 
#>  Values   │   Lower secondary    Upper secondary    Tertiary │   Total 
#> ──────────┼──────────────────────────────────────────────────┼─────────
#>  No       │               179                415         332 │     926 
#>  Yes      │                78                112          59 │     249 
#> ──────────┼──────────────────────────────────────────────────┼─────────
#>  Total    │               257                527         391 │    1175 
#> 
#> Chi-2(2) = 21.6, p <.001
#> Cramer's V = 0.14

Add percentages with percent:

cross_tab(sochealth, smoking, education, percent = "col")
#> Crosstable: smoking x education (Column %)
#> 
#>  Values   │   Lower secondary    Upper secondary    Tertiary │   Total 
#> ──────────┼──────────────────────────────────────────────────┼─────────
#>  No       │              69.6               78.7        84.9 │    78.8 
#>  Yes      │              30.4               21.3        15.1 │    21.2 
#> ──────────┼──────────────────────────────────────────────────┼─────────
#>  Total    │             100.0              100.0       100.0 │   100.0 
#>  N        │               257                527         391 │    1175 
#> 
#> Chi-2(2) = 21.6, p <.001
#> Cramer's V = 0.14

Group by a third variable with by:

cross_tab(sochealth, smoking, education, by = sex)
#> Crosstable: smoking x education (N) | sex = Female
#> 
#>  Values   │   Lower secondary    Upper secondary    Tertiary │   Total 
#> ──────────┼──────────────────────────────────────────────────┼─────────
#>  No       │                95                220         160 │     475 
#>  Yes      │                38                 62          31 │     131 
#> ──────────┼──────────────────────────────────────────────────┼─────────
#>  Total    │               133                282         191 │     606 
#> 
#> Chi-2(2) = 7.1, p = .029
#> Cramer's V = 0.11
#> 
#> Crosstable: smoking x education (N) | sex = Male
#> 
#>  Values   │   Lower secondary    Upper secondary    Tertiary │   Total 
#> ──────────┼──────────────────────────────────────────────────┼─────────
#>  No       │                84                195         172 │     451 
#>  Yes      │                40                 50          28 │     118 
#> ──────────┼──────────────────────────────────────────────────┼─────────
#>  Total    │               124                245         200 │     569 
#> 
#> Chi-2(2) = 15.6, p <.001
#> Cramer's V = 0.17

When both variables are ordered factors, cross_tab() automatically selects an ordinal measure (Kendall’s Tau-b) instead of Cramer’s V:

cross_tab(sochealth, self_rated_health, education)
#> Crosstable: self_rated_health x education (N)
#> 
#>  Values      │   Lower secondary    Upper secondary    Tertiary │   Total 
#> ─────────────┼──────────────────────────────────────────────────┼─────────
#>  Poor        │                28                 28           5 │      61 
#>  Fair        │                86                118          62 │     266 
#>  Good        │               102                263         193 │     558 
#>  Very good   │                44                118         133 │     295 
#> ─────────────┼──────────────────────────────────────────────────┼─────────
#>  Total       │               260                527         393 │    1180 
#> 
#> Chi-2(6) = 73.2, p <.001
#> Kendall's Tau-b = 0.20

Association measures

For a quick overview of all available association statistics, pass a contingency table to assoc_measures():

tbl <- xtabs(~ smoking + education, data = sochealth)
assoc_measures(tbl)
#> Measure                            Estimate     SE  CI lower  CI upper      p 
#> Cramer's V                            0.136     --     0.079     0.191  <.001 
#> Contingency Coefficient               0.134     --        --        --  <.001 
#> Lambda symmetric                      0.000  0.000     0.000     0.000     -- 
#> Lambda R|C                            0.000  0.000     0.000     0.000     -- 
#> Lambda C|R                            0.000  0.000     0.000     0.000     -- 
#> Goodman-Kruskal's Tau R|C             0.018  0.008     0.003     0.034   .023 
#> Goodman-Kruskal's Tau C|R             0.008  0.003     0.001     0.014   .022 
#> Uncertainty Coefficient symmetric     0.011  0.005     0.002     0.021   .021 
#> Uncertainty Coefficient R|C           0.018  0.008     0.003     0.032   .021 
#> Uncertainty Coefficient C|R           0.009  0.004     0.001     0.016   .021 
#> Goodman-Kruskal Gamma                -0.268  0.056    -0.378    -0.158  <.001 
#> Kendall's Tau-b                      -0.126  0.027    -0.180    -0.073  <.001 
#> Kendall's Tau-c                      -0.117  0.026    -0.167    -0.067  <.001 
#> Somers' D R|C                        -0.091  0.020    -0.131    -0.052  <.001 
#> Somers' D C|R                        -0.175  0.038    -0.249    -0.101  <.001

Individual functions such as cramer_v(), gamma_gk(), or kendall_tau_b() return a scalar by default. Pass detail = TRUE for the confidence interval and p-value:

cramer_v(tbl, detail = TRUE)
#> Estimate  CI lower  CI upper      p
#>    0.136     0.079     0.191  <.001

Summary tables

table_categorical() covers grouped or one-way summary tables for categorical variables:

table_categorical(
  sochealth,
  select = c(smoking, physical_activity, dentist_12m),
  by = education,
  output = "tinytable"
)

table_continuous() summarizes continuous variables, either overall or by a categorical by variable, and can also add group-comparison tests:

table_continuous(
  sochealth,
  select = c(bmi, life_sat_health),
  by = education
)
#> Descriptive statistics
#> 
#>  Variable                       │ Group              M     SD    Min    Max  
#> ────────────────────────────────┼────────────────────────────────────────────
#>  Body mass index                │ Lower secondary  28.09  3.47  18.20  38.90 
#>                                 │ Upper secondary  26.02  3.43  16.00  37.10 
#>                                 │ Tertiary         24.39  3.52  16.00  33.00 
#> ╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌
#>  Satisfaction with health (1-5) │ Lower secondary   2.71  1.20   1.00   5.00 
#>                                 │ Upper secondary   3.53  1.19   1.00   5.00 
#>                                 │ Tertiary          4.11  1.04   1.00   5.00 
#> 
#>  Variable                       │ Group            95% CI LL  95% CI UL   n  
#> ────────────────────────────────┼────────────────────────────────────────────
#>  Body mass index                │ Lower secondary    27.66      28.51    260 
#>                                 │ Upper secondary    25.73      26.31    534 
#>                                 │ Tertiary           24.04      24.74    394 
#> ╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌
#>  Satisfaction with health (1-5) │ Lower secondary     2.57       2.86    259 
#>                                 │ Upper secondary     3.43       3.63    534 
#>                                 │ Tertiary            4.01       4.21    399 
#> 
#>  Variable                       │ Group              p   
#> ────────────────────────────────┼────────────────────────
#>  Body mass index                │ Lower secondary  <.001 
#>                                 │ Upper secondary        
#>                                 │ Tertiary               
#> ╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌
#>  Satisfaction with health (1-5) │ Lower secondary  <.001 
#>                                 │ Upper secondary        
#>                                 │ Tertiary

table_continuous_lm() covers the same reporting territory when you want to stay in a linear-model framework, for example with robust or cluster-robust standard errors, case weights, or additive covariate adjustment:

table_continuous_lm(
  sochealth,
  select = c(wellbeing_score, bmi),
  by = sex,
  vcov = "HC3"
)
#> Continuous outcomes by Sex
#> 
#>  Variable                      │ M (Female)  M (Male)  Δ (Male - Female) 
#> ───────────────────────────────┼─────────────────────────────────────────
#>  WHO-5 wellbeing index (0-100) │   67.16      71.05          3.89        
#>  Body mass index               │   25.69      26.20          0.51        
#> 
#>  Variable                      │ 95% CI LL  95% CI UL    p     R²    n   
#> ───────────────────────────────┼─────────────────────────────────────────
#>  WHO-5 wellbeing index (0-100) │   2.12       5.65     <.001  0.02  1200 
#>  Body mass index               │   0.09       0.93      .018  0.00  1188

For detailed guidance, see the dedicated articles on table_categorical(), table_continuous(), table_continuous_lm(), and the final reporting overview for APA-style summary tables.

Row-wise summaries

mean_n(), sum_n(), and count_n() compute row-wise statistics across selected columns, with automatic handling of missing values.

sochealth |>
  dplyr::mutate(
    mean_sat  = mean_n(select = starts_with("life_sat")),
    sum_sat   = sum_n(select = starts_with("life_sat"), min_valid = 2),
    n_missing = count_n(select = starts_with("life_sat"), special = "NA")
  ) |>
  dplyr::select(starts_with("life_sat"), mean_sat, sum_sat, n_missing) |>
  head() |>
  as.data.frame()
#>   life_sat_health life_sat_work life_sat_relationships life_sat_standard
#> 1               5             3                      5                 5
#> 2               4             4                      5                 5
#> 3               3             2                      5                 3
#> 4               3             4                      3                 2
#> 5               4             5                      4                 4
#> 6               5             5                      5                 3
#>   mean_sat sum_sat n_missing
#> 1     4.50      18         0
#> 2     4.50      18         0
#> 3     3.25      13         0
#> 4     3.00      12         0
#> 5     4.25      17         0
#> 6     4.50      18         0

Learn more

See ?varlist to inspect variables, labels, values, and missing data.
See ?freq for one-way frequency tables (weights, sorting, custom missing values, labelled-data display modes).
See ?cross_tab for the full list of arguments (weights, simulation, association measures).
See ?assoc_measures for the complete list of association statistics; ?cramer_v for the canonical entry point.
See ?table_categorical for grouped or one-way categorical tables.
See ?table_continuous for continuous summaries and group comparisons.
See ?table_continuous_lm for model-based mean-comparison tables with robust / cluster-robust / bootstrap / jackknife SE, case weights, or additive covariate adjustment.
See ?mean_n, ?sum_n, ?count_n for row-wise summaries with optional minimum-valid-values rules.
See ?code_book to generate an interactive HTML codebook; ?label_from_names to derive variable labels from "code. label"-style column names (e.g., LimeSurvey exports).

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.