The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
This vignette is a decision guide for choosing and checking weights
in cat2cat().
Read it when you want to answer one of these questions:
If you only need the basic two-period workflow, go back to Get Started. If you need multi-period, panel, aggregated, or regression workflows, continue to Advanced Workflows.
library(cat2cat)
library(dplyr)
library(tidyr)
library(e1071)
library(randomForest)
data(occup, package = "cat2cat")
data(trans, package = "cat2cat")
occup_2008 <- occup[occup$year == 2008, ]
occup_2010 <- occup[occup$year == 2010, ]
occup_2012 <- occup[occup$year == 2012, ]cat2cat offers several ways to assign probability
weights to replicated observations. Each method encodes a different
distributional assumption about how ambiguous
observations split across candidate categories. When a downstream
estimand depends on the mapped category, this is the identifying
assumption for that estimand - so always check sensitivity.
Naive weights (wei_naive_c2c) are
always computed. Each replicated observation gets uniform probability
\(1/k\) where \(k\) is the number of candidate
categories.
Frequency-based weights (wei_freq_c2c)
are the default. They use category counts from the base period.
ML weights (wei_knn_c2c,
wei_lda_c2c, wei_rf_c2c,
wei_nb_c2c) use individual features to predict category
membership.
cat2cat_ml_run()Available ML methods:
k.ntree tuning.e1071. Fast,
useful after numeric/logical/factor preprocessing. Assumes conditional
independence of features.ML features must be numeric, logical, or factor columns. Factor columns are one-hot encoded automatically using levels observed in the training data and the target period. Character columns are not encoded automatically; convert them to factors first if they represent categories.
You can run multiple methods at once and compare or combine them:
occup_2_mix <- cat2cat(
data = list(
old = occup_2008, new = occup_2010,
cat_var = "code", time_var = "year"
),
mappings = list(trans = trans, direction = "backward"),
ml = list(
data = occup_2010,
cat_var = "code",
method = c("knn", "rf", "lda", "nb"),
features = c("age", "sex", "edu", "exp", "parttime", "salary"),
args = list(k = 10, ntree = 50),
on_fail = "na"
)
)Correlations between weight methods:
occup_2_mix$old %>%
select(wei_knn_c2c, wei_rf_c2c, wei_lda_c2c, wei_nb_c2c, wei_freq_c2c, wei_naive_c2c) %>%
cor(use = "pairwise.complete.obs")
#> wei_knn_c2c wei_rf_c2c wei_lda_c2c wei_nb_c2c wei_freq_c2c
#> wei_knn_c2c 1.0000000 0.8655665 0.8327864 0.6196138 0.8989887
#> wei_rf_c2c 0.8655665 1.0000000 0.8825387 0.6519528 0.8755385
#> wei_lda_c2c 0.8327864 0.8825387 1.0000000 0.6592195 0.8667809
#> wei_nb_c2c 0.6196138 0.6519528 0.6592195 1.0000000 0.6107475
#> wei_freq_c2c 0.8989887 0.8755385 0.8667809 0.6107475 1.0000000
#> wei_naive_c2c 0.4908619 0.4754159 0.4811839 0.5594270 0.5449029
#> wei_naive_c2c
#> wei_knn_c2c 0.4908619
#> wei_rf_c2c 0.4754159
#> wei_lda_c2c 0.4811839
#> wei_nb_c2c 0.5594270
#> wei_freq_c2c 0.5449029
#> wei_naive_c2c 1.0000000on_fail and
fail_warnSometimes ML probabilities cannot be produced for a subset of
replicated rows (for example incomplete target features or
method-specific prediction failures). cat2cat() exposes
explicit policy controls in ml:
on_fail = "freq" (default): failed ML rows are filled
with wei_freq_c2c.on_fail = "naive": failed ML rows are filled with
wei_naive_c2c.on_fail = "na": failed ML rows are kept as
NA.on_fail = "error": stop immediately when failed rows
are detected.fail_warn = TRUE (default): warn with affected
rows/observations per method.fail_warn = FALSE: suppress these warnings.Important: this failure accounting is specific to
cat2cat() and the constructed weight columns
(wei_*_c2c). It is different from
cat2cat_ml_run() “SKIPPED GROUPS”, which reports mapping
groups that were not evaluated in holdout diagnostics (single category,
too few observations, or method fit/predict error for that group).
ml_setup <- list(
data = bind_rows(occup_2010, occup_2012),
cat_var = "code",
method = c("knn", "rf", "lda"),
features = c("age", "sex", "edu", "exp", "parttime", "salary"),
args = list(k = 10, ntree = 50),
on_fail = "freq", # default policy
fail_warn = TRUE # default reporting
)
# strict mode for QA pipelines
ml_strict <- ml_setup
ml_strict$on_fail <- "error"
# diagnostic mode to inspect failures directly
ml_diag <- ml_setup
ml_diag$on_fail <- "na"
ml_diag$fail_warn <- FALSEEnsemble weights with cross_c2c() and pruning with
prune_c2c():
Different weight methods affect regression coefficients when you filter to a specific occupation group and combine both periods. This is the proper sensitivity analysis: subjects from the base period (new, no replication) plus subjects from the target period (old, weighted by probability of belonging to this group).
Run backward mapping with all ML methods:
result_all <- cat2cat(
data = list(old = occup_2008, new = occup_2010,
cat_var = "code", time_var = "year"),
mappings = list(trans = trans, direction = "backward"),
ml = list(
data = occup_2010, cat_var = "code",
method = c("knn", "rf", "lda", "nb"),
features = c("age", "sex", "edu", "exp", "parttime", "salary"),
args = list(k = 10, ntree = 50)
)
)Weighted counts per group - compare how weight methods redistribute observations:
weight_cols <- c("wei_naive_c2c", "wei_freq_c2c", "wei_knn_c2c", "wei_rf_c2c", "wei_lda_c2c", "wei_nb_c2c")
# Pick groups with high replication
top_groups <- result_all$old %>%
filter(rep_c2c > 1) %>%
count(g_new_c2c, sort = TRUE) %>%
head(6) %>%
pull(g_new_c2c)
# Weighted counts from OLD period (replicated)
old_counts <- lapply(weight_cols, function(wcol) {
result_all$old %>%
filter(g_new_c2c %in% top_groups) %>%
group_by(g_new_c2c) %>%
summarise(n = sum(.data[[wcol]]), .groups = "drop")
}) %>%
setNames(gsub("wei_|_c2c", "", weight_cols)) %>%
bind_rows(.id = "method") %>%
tidyr::pivot_wider(names_from = method, values_from = n)
# Counts from NEW period (no replication, exact)
new_counts <- result_all$new %>%
filter(code %in% top_groups) %>%
count(code, name = "new_period") %>%
rename(g_new_c2c = code)
# Combine for comparison
left_join(old_counts, new_counts, by = "g_new_c2c")
#> # A tibble: 6 × 8
#> g_new_c2c naive freq knn rf lda nb new_period
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 232002 23.1 21.9 29.2 21.4 29.7 21.9 30
#> 2 232003 23.1 19.7 23.9 19.3 25.1 19.7 27
#> 3 232004 23.1 5.10 5 4.66 8.05 5.10 7
#> 4 232005 23.1 3.65 5.1 4.4 7.49 3.65 5
#> 5 232006 23.1 16.8 15.2 14.0 18.5 16.8 23
#> 6 232007 23.1 2.92 2.5 1.9 3.74 2.92 4The new_period column shows the actual counts in 2010.
The other columns show how the 2008 observations are redistributed under
each weight method. naive assigns uniform probability (1/n
candidates), freq uses base period frequencies, and ML
methods (knn, rf, lda,
nb) use predicted probabilities.
Pick a specific group for regression analysis:
# New-period counts per category (no replication, so plain tally)
new_counts_all <- result_all$new %>%
count(code, name = "n_new") %>%
rename(g_new_c2c = code)
# Old-period weighted counts, joined to new-period counts
group_sizes <- result_all$old %>%
group_by(g_new_c2c) %>%
summarise(n_old = sum(wei_freq_c2c), .groups = "drop") %>%
left_join(new_counts_all, by = "g_new_c2c") %>%
filter(n_old >= 10, n_new >= 10) %>%
arrange(desc(n_old))
# Pick a group for regression analysis
target_group <- group_sizes$g_new_c2c[1]
cat("Analysing occupation group:", target_group, "\n")
#> Analysing occupation group: 222101Regression within a single occupation group - combine both periods and compare coefficients:
# Subset old period to target group (with weights)
old_subset <- result_all$old %>%
filter(g_new_c2c == target_group)
# Subset new period to target group (no replication, weight = 1)
new_subset <- result_all$new %>%
filter(code == target_group) %>%
mutate(
wei_naive_c2c = 1, wei_freq_c2c = 1, wei_knn_c2c = 1,
wei_rf_c2c = 1, wei_lda_c2c = 1, wei_nb_c2c = 1
)
# Combine both periods
d <- bind_rows(old_subset, new_subset)
# Compare all regression coefficients across weight methods
f <- I(log(salary)) ~ age + sex + factor(edu) + exp + parttime
coefs <- sapply(weight_cols, function(wcol) {
d$w <- d$multiplier * d[[wcol]]
coef(lm(f, data = d, weights = w))
})
colnames(coefs) <- gsub("wei_|_c2c", "", weight_cols)
round(coefs, 4)
#> naive freq knn rf lda nb
#> (Intercept) 9.3225 9.2194 9.2118 9.2085 9.2229 9.2194
#> age -0.0090 -0.0083 -0.0081 -0.0079 -0.0085 -0.0083
#> sexTRUE -0.0042 -0.1153 -0.1421 -0.1323 -0.1253 -0.1153
#> factor(edu)2 -0.1317 -0.1131 -0.1157 -0.1119 -0.1108 -0.1131
#> factor(edu)3 -0.1036 -0.1065 -0.1090 -0.1055 -0.1033 -0.1065
#> factor(edu)4 -0.1333 -0.1450 -0.1472 -0.1451 -0.1428 -0.1450
#> factor(edu)5 -0.1884 -0.1370 -0.1326 -0.1396 -0.1303 -0.1370
#> exp 0.0138 0.0131 0.0128 0.0128 0.0132 0.0131
#> parttime 1.3797 1.4348 1.4411 1.4311 1.4343 1.4348All coefficients can vary because weight methods change which old-period subjects contribute to this occupation group.
Note: Pruning discards probability information and should be used only after analysis with full weights. Prefer
prune_c2c(method = "nonzero")to remove impossible candidates while preserving the probability distribution. More aggressive pruning (highest1) is appropriate only for descriptive tables or when you need exactly one category per observation.
# Compare regression coefficients under different pruning strategies
prune_methods <- c("nonzero", "highest", "highest1")
prune_coefs <- sapply(prune_methods, function(pm) {
old_pruned <- result_all$old %>%
prune_c2c(method = pm) %>%
filter(g_new_c2c == target_group)
d <- bind_rows(old_pruned, new_subset)
d$w <- d$multiplier * d$wei_freq_c2c
coef(lm(f, data = d, weights = w))
})
round(prune_coefs, 4)
#> nonzero highest highest1
#> (Intercept) 9.2194 9.2143 9.2143
#> age -0.0083 -0.0083 -0.0083
#> sexTRUE -0.1153 -0.1200 -0.1200
#> factor(edu)2 -0.1131 -0.1122 -0.1122
#> factor(edu)3 -0.1065 -0.1068 -0.1068
#> factor(edu)4 -0.1450 -0.1454 -0.1454
#> factor(edu)5 -0.1370 -0.1337 -0.1337
#> exp 0.0131 0.0131 0.0131
#> parttime 1.4348 1.4384 1.4384cross_c2c() creates a weighted average of multiple
weight columns. Vary the mix:
configs <- list(
equal = c(1, 1) / 2,
freq_heavy = c(3, 1) / 4,
ml_heavy = c(1, 3) / 4
)
ens_coefs <- sapply(names(configs), function(nm) {
old_ens <- result_all$old %>%
cross_c2c(c("wei_freq_c2c", "wei_knn_c2c"), configs[[nm]]) %>%
filter(g_new_c2c == target_group)
new_ens <- new_subset %>% mutate(wei_cross_c2c = 1)
d <- bind_rows(old_ens, new_ens)
d$w <- d$multiplier * d$wei_cross_c2c
coef(lm(f, data = d, weights = w))
})
round(ens_coefs, 4)
#> equal freq_heavy ml_heavy
#> (Intercept) 9.2155 9.2175 9.2136
#> age -0.0082 -0.0083 -0.0082
#> sexTRUE -0.1287 -0.1220 -0.1354
#> factor(edu)2 -0.1144 -0.1138 -0.1151
#> factor(edu)3 -0.1078 -0.1072 -0.1084
#> factor(edu)4 -0.1462 -0.1456 -0.1467
#> factor(edu)5 -0.1348 -0.1359 -0.1337
#> exp 0.0130 0.0131 0.0129
#> parttime 1.4381 1.4365 1.4396When regression coefficients are stable across weight methods, pruning strategies, and ensemble compositions, report with confidence. When they diverge, the mapping introduces uncertainty - report the range or investigate the source.
The ml argument in cat2cat() adds ML-based
probability weights, but ML is not guaranteed to improve over simpler
baselines. cat2cat_ml_run() provides per-group holdout
(single train/test split) diagnostics to answer this question
before committing to a method.
cat2cat_ml_run() is doingFor each mapping group (set of candidate categories linked by the
transition table) cat2cat_ml_run():
ml$data whose category
belongs to the group.1 - test_prop) and
test (test_prop) sets.Groups with fewer than 5 observations or only one candidate category
are skipped. Also note that cat2cat_ml_run() does not use
on_fail; it is a diagnostic tool and reports skipped groups
instead of applying row-level fallback weights.
cv_knn <- cat2cat_ml_run(
mappings = list(trans = trans, direction = "backward"),
ml = list(
data = bind_rows(occup_2010, occup_2012),
cat_var = "code",
method = "knn",
features = c("age", "sex", "edu", "exp", "parttime", "salary"),
args = list(k = 10)
)
)
print(cv_knn)
#> === cat2cat ML Cross-Validation Results ===
#>
#> ACCURACY (higher is better):
#> naive (1/k): 0.1805
#> freq (most common): 0.5324
#> knn: accuracy = 0.5228
#>
#> BRIER SCORE (lower is better, range 0-1):
#> naive: 0.4097
#> freq: 0.3026
#> knn: brier = 0.3249
#>
#> MEAN P(TRUE CLASS) (higher is better):
#> naive: 0.1805
#> freq: 0.4191
#> knn: mean P(true) = 0.4492
#>
#> ACCURACY: ML vs BASELINES (percent of groups where ML wins):
#> knn > naive: 89.4%
#> knn > freq: 24.9%
#>
#> SKIPPED GROUPS (single category or <5 observations):
#> knn: 32.6%The print() summary reports:
naive (1/k) is the random-guess
baseline, freq is the majority-class baseline, and each ML
line reports top-class accuracy for that method.cat2cat ultimately uses probability weights, not just hard
classifications.cat2cat, because it measures the
quality of the probability weights themselves.naive or beats freq
on accuracy. This is a win-rate summary, not an average accuracy
gap.So for output like:
knn > naive: 87.7%knn > freq: 18.0%knn: accuracy = 0.5108 vs
freq (most common): 0.5366the right reading is: kNN clearly beats the naive baseline, but it
does not beat the frequency baseline on top-class
accuracy overall. In that case, wei_freq_c2c remains the
default choice if your only goal is classification accuracy.
At the same time, if kNN has a slightly lower Brier score and a
higher mean P(true class) than freq, then it may still be
producing better-calibrated probability weights even though its top
prediction is less often correct. That distinction matters in
cat2cat, because the mapped weights are probabilities
distributed across candidate categories rather than single-class
assignments.
cv_all <- cat2cat_ml_run(
mappings = list(trans = trans, direction = "backward"),
ml = list(
data = bind_rows(occup_2010, occup_2012),
cat_var = "code",
method = c("knn", "lda", "rf", "nb"),
features = c("age", "sex", "edu", "exp", "parttime", "salary"),
args = list(k = 10, ntree = 50)
)
)
print(cv_all)
#> === cat2cat ML Cross-Validation Results ===
#>
#> ACCURACY (higher is better):
#> naive (1/k): 0.1805
#> freq (most common): 0.5402
#> knn: accuracy = 0.5392
#> lda: accuracy = 0.5453
#> rf: accuracy = 0.5428
#> nb: accuracy = 0.3977
#>
#> BRIER SCORE (lower is better, range 0-1):
#> naive: 0.4097
#> freq: 0.2923
#> knn: brier = 0.3105
#> lda: brier = 0.3250
#> rf: brier = 0.3030
#> nb: brier = 0.4542
#>
#> MEAN P(TRUE CLASS) (higher is better):
#> naive: 0.1805
#> freq: 0.4268
#> knn: mean P(true) = 0.4578
#> lda: mean P(true) = 0.4784
#> rf: mean P(true) = 0.4709
#> nb: mean P(true) = 0.4149
#>
#> ACCURACY: ML vs BASELINES (percent of groups where ML wins):
#> knn > naive: 90.8%
#> lda > naive: 91.5%
#> rf > naive: 90.8%
#> nb > naive: 78.8%
#> knn > freq: 21.8%
#> lda > freq: 34.1%
#> rf > freq: 29.2%
#> nb > freq: 17.8%
#>
#> SKIPPED GROUPS (single category or <5 observations):
#> knn: 33.6%
#> lda: 46.3%
#> rf: 33.8%
#> nb: 34.1%Interpretation tip for mixed outputs:
cat2cat() but a non-zero skipped-group rate in
cat2cat_ml_run().The returned object is a named list. Each element corresponds to one mapping group:
# Pick a group with multiple candidates
group_names <- names(cv_all)
example_group <- group_names[
which(vapply(cv_all, function(g) !is.na(g$freq) && g$naive < 1, logical(1)))[1]
]
cv_all[[example_group]]
#> $naive
#> [1] 0.3333333
#>
#> $acc
#> knn lda rf nb
#> 1 NA 1 1
#>
#> $freq
#> [1] 1
#>
#> $brier
#> knn lda rf nb
#> 0.0000 NA 0.0025 NA
#>
#> $mean_prob
#> knn lda rf nb
#> 1.000 NA 0.975 NA
#>
#> $naive_brier
#> [1] 0.3333333
#>
#> $naive_mean_prob
#> [1] 0.3333333
#>
#> $freq_brier
#> [1] 0.00390625
#>
#> $freq_mean_prob
#> [1] 0.9375Each group entry contains the group-level diagnostics behind the printed summary:
$naive - \(1/k\)
random-guess accuracy for that group.$freq - majority-class accuracy for that group.$acc - named numeric vector with ML accuracy by
method.$naive_brier and $freq_brier - baseline
Brier scores.$brier - named numeric vector with ML Brier scores by
method.$naive_mean_prob and $freq_mean_prob -
baseline mean P(true class).$mean_prob - named numeric vector with ML mean P(true
class) by method.Understanding model performance in context: This is multi-class classification - each mapping group can have 3-10+ candidate categories. A naive random guess yields only ~18% accuracy (1/k where k is the number of candidates). Achieving 50%+ is substantial improvement over random - do not compare these numbers to binary classification benchmarks where 80%+ is typical. The key question is whether ML beats the frequency baseline, not whether it reaches some absolute threshold.
| Scenario | Recommendation |
|---|---|
| ML model performance >> freq across most groups | ML weights add genuine signal; use them |
| ML model performance \(\approx\) freq | ML is no better than frequency; prefer wei_freq_c2c
(simpler, faster) |
| ML model performance < freq for many groups | ML is adding noise; do not use ML weights |
| High skipped-group rate (>20%) | Features may have too many missing values, groups are too small, or method fitting is unstable |
Because the train/test split is random, results vary between runs.
For more stable estimates, pool more data into ml$data
(e.g. multiple survey waves) or run cat2cat_ml_run()
several times and average the summaries.
Caveat: high
cat2cat_ml_run()model performance means the model discriminates well within mapping groups. It does not validate the mapping table itself. A perfect model with a wrong transition table will still produce wrong results.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.