Mixed-Subjects IRT Calibration

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

This vignette shows the recommended mixed-subjects workflow for a unidimensional 2PL model using the marginal maximum-likelihood (MML) estimator (fit_mixed_subjects_mml). The package expects three item response matrices with no respondent IDs and the same ordering of item columns:

All three matrices must contain binary responses $(Y_{ij} \in \{0,1\})$. Probability (fractional) predictions are not accepted, as they are not a valid likelihood input for the MML IRT objective and break the correction. If your prediction model outputs probabilities, sample binary responses from them first (e.g. rbinom).

\[L_o^{\mathrm{marg}}(\gamma) + \lambda\bigl[ L_g^{\mathrm{marg}}(\gamma) - L_p^{\mathrm{marg}}(\gamma)\bigr]\]

where $\gamma$ is a vector of item parameters and each $L^\mathrm{marg}$ is the true IRT marginal negative log-likelihood, with posteriors recomputed from the current candidate $\gamma$ at every gradient step. Setting $\lambda = 0$ recovers the human-only MML calibration. See the Choosing Lambda vignette for the specific background on why the MML objective is preferred over expected-count based estimators.

Simulate example data

library(mixedsubjectsirt)
library(ggplot2)

set.seed(242424)

n_human    <- 400
n_generated <- 1200
n_items    <- 8

true_pars <- data.frame(
  item = paste0("Item", seq_len(n_items)),
  a    = seq(0.8, 1.6, length.out = n_items),
  d    = seq(-1.1, 1.1, length.out = n_items)
)
true_pars$b <- -true_pars$d / true_pars$a

theta_human <- rnorm(n_human)
observed    <- simulate_2pl(theta_human, true_pars)

# Strongly informative predictions. On the n labeled subjects, predictions
# match human responses (the F = Y benchmark from the simulation study), and
# N additional unlabeled responses are drawn from the same 2PL. This is
# the "good predictor" regime, where the method should lean on unlabeled data.
# (Further down we show the complementary case of an uninformative predictor,
# where the criterion instead drives lambda to 0.)
predicted <- observed
generated <- simulate_2pl(rnorm(n_generated), true_pars)

Step 1: Fit the human baseline

Baseline models are estimated using mirt.¹

human_start <- fit_2pl(observed, technical = list(NCYCLES = 500))

Step 2: Fit the MML mixed-subjects model

fit_mixed_subjects_mml() uses a MML-based EM procedure for iterative estimation at a given $\lambda$ value. If you have a value of $\lambda$ already (say from a pilot study or previous ability tuning), the model can be estimated directly here.

mixed_mml <- fit_mixed_subjects_mml(
  observed  = observed,
  predicted = predicted,
  generated = generated,
  lambda    = 0.5,
  initial_pars = human_start$pars
)

mixed_mml
#> mixedsubjectsirt 2PL fit
#>   items:      8
#>   lambda:     0.5
#>   loss:       4.91044
#>   convergence: 0 
#>   estimator:  marginal MML PPI++

mixed_mml$item_pars
#>    item         a          d          b
#> 1 Item1 0.7431515 -1.0825772  1.4567382
#> 2 Item2 1.0067308 -0.7278419  0.7229757
#> 3 Item3 0.8547961 -0.4113346  0.4812078
#> 4 Item4 1.0999070 -0.1453192  0.1321195
#> 5 Item5 1.1673630  0.1875514 -0.1606624
#> 6 Item6 1.2566356  0.4923054 -0.3917646
#> 7 Item7 1.5526271  0.8477928 -0.5460376
#> 8 Item8 1.5472343  1.1508445 -0.7438075

Step 3: Select $\lambda$ by ability-score risk

tune_lambda_ability_risk() with fit_fn = fit_mixed_subjects_mml selects the $\lambda$ that minimizes propagated ability-score risk $\mathbb{E}\big[g'\Sigma_\gamma g\big]$ (where $\Sigma_\gamma$ is the Louis-corrected marginal sandwich covariance).² By default this is done by direct 1-D optimization over [0, 1]. The final fit from this optimal $\lambda$ is also returned as the best_fit object within the output list, so the user is not required to call fit_mixed_subjects_mml again.

ability_tuned <- tune_lambda_ability_risk(
  observed     = observed,
  predicted    = predicted,
  generated    = generated,
  target_resp  = observed,
  initial_pars = human_start$pars,
  fit_fn       = fit_mixed_subjects_mml,
  control      = list(maxit = 200)
)

ability_tuned$best_lambda
#> [1] 0.7924548

Because the predictor is highly informative, the approach selects a $\lambda$ near the theoretical maximum $N/(n+N) = 1200/1600 = 0.75$. As such, the method leans heavily on the unlabeled response data. (The returned summary records each $\lambda$ the optimizer evaluated, with selection_risk = Inf for any that failed to converge, so selection is protected against numerical failures. When the predictor is uninformative the same criterion drives $\lambda$ to 0 instead; see below.)

Step 3b (recommended workflow): cross-fit $\lambda$ tuning

tune_lambda_ability_risk() above estimated an appropriate $\lambda$ and fits the final model on the same data used for this estimation. Previous analysis of prediction-powered inference in finite samples shows that estimating $\lambda$ on the data you also estimate model parameters with is optimistic. Item parameters estimated this are biased in finite samples and may undercover true parameter values.³

tune_lambda_ability_risk_crossfit() removes this bias by tuning $\lambda$ on held-out data: each fold’s $\lambda$ is chosen using only out of fold item responses, and the final fit combines them. Passing fit_fn = fit_mixed_subjects_mml for the per-fold tuning and final_fit_fn = fit_mixed_subjects_mml for the final fit produces a single scalar-$\lambda$ mixed subjects calibration fit (the fold-specific $\lambda$’s are averaged, weighted by fold size).

cf_tuned <- tune_lambda_ability_risk_crossfit(
  observed     = observed,
  predicted    = predicted,
  generated    = generated,
  initial_pars = human_start$pars,
  fit_fn       = fit_mixed_subjects_mml,
  final_fit_fn = fit_mixed_subjects_mml,
  n_splits     = 2,          # the standard PPI sample split (also the default)
  control      = list(maxit = 200)
)

cf_tuned$lambda_by_split   # one tuned lambda per held-out fold
#> [1] 0.8758023 0.8896977
cf_tuned$lambda_final      # fold-size-weighted scalar used for the final fit
#> [1] 0.88275

The object cf_tuned$final_fit provides the final model fit from calibration, and vcov(cf_tuned$final_fit) gives its Louis-corrected covariance. We suggest use of the default n_splits = 2, the standard PPI sample split (one half tunes, the other estimates, then swap). You may notice the cross-fit $\lambda$ (here $\approx 0.85$) sits above the same-data value ($\approx 0.7$) and above the perfect-predictor ceiling $N/(n+N) = 0.75$. That is expected and is not evidence of a better $\lambda$: with two folds each $\lambda$ is tuned on only half the labeled subjects ($n_\text{train} = n/2 = 200$), and the PPI-optimal $\lambda$ grows as labeled data shrinks (it tracks $N/(N + n_\text{train})$, here $1200/1400 \approx 0.86$). The fold $\lambda$’s are tuned for $n_\text{train}$ but the final fit applies their average to the full sample, so they run a little higher than $\lambda$ estimated in the same sample; in operational settings where $N \gg n$, this difference shrinks to zero. Importantly, the reason to cross-fit is not the $\lambda$ value, but to remove finite-sample item parameter bias. The cheaper same-data tuner in Step 3 is fine for exploration; prefer the cross-fit estimate for operational calibrations or further research.

Step 4: Inspect the covariance

vcov() on a scalar-lambda MML fit automatically uses vcov_mixed_subjects_mml(), which applies Louis’ observed-information correction.⁴ Here, the bread is $H_\mathrm{comp} - I_\mathrm{miss}$ rather than the EM complete-data Hessian alone.

Sigma <- vcov(cf_tuned$final_fit)
dim(Sigma)  # 2J × 2J
#> [1] 16 16

Compare calibrations

human_only <- fit_mixed_subjects_mml(
  observed  = observed,
  predicted = predicted,
  generated = generated,
  lambda    = 0,
  initial_pars = human_start$pars
)

estimates <- rbind(
  data.frame(estimator = "human only",  human_only$item_pars),
  data.frame(estimator = "MML lambda = 0.5", mixed_mml$item_pars),
  data.frame(estimator = "MML ability-risk",
             ability_tuned$best_fit$item_pars)
)
estimates$true_b <- true_pars$b[match(estimates$item, true_pars$item)]
estimates$true_a <- true_pars$a[match(estimates$item, true_pars$item)]

ggplot(estimates, aes(true_b, b, colour = estimator)) +
  geom_abline(slope = 1, intercept = 0, linewidth = 0.4) +
  geom_point(size = 2) +
  labs(x = "True difficulty", y = "Estimated difficulty", colour = NULL) +
  theme_minimal()

When the LLM is uninformative

While the method is able to efficiently exploit information derived from a good predictor, it is also capable of rejecting useless information derived from a bad one. Here we simulate responses that are essentially unrelated to ability (drawn from scrambled item parameters), so the paired correction between observed and predicted carries no usable signal:

set.seed(242424)
bad_pars    <- true_pars
bad_pars$a  <- pmax(0.05, abs(rnorm(n_items, 0, 0.1)))   # near-zero slopes
bad_pars$d  <- rnorm(n_items, 0, 2)                       # difficulties unrelated to truth

predicted_bad <- simulate_2pl(theta_human, bad_pars)
generated_bad <- simulate_2pl(rnorm(n_generated), bad_pars)

tuned_bad <- tune_lambda_ability_risk(
  observed     = observed,
  predicted    = predicted_bad,
  generated    = generated_bad,
  target_resp  = observed,
  initial_pars = human_start$pars,
  fit_fn       = fit_mixed_subjects_mml,
  control      = list(maxit = 200)
)

tuned_bad$best_lambda   # expect ~0: the useless LLM is correctly ignored
#> [1] 0.03332212

The criterion drives $\lambda \to 0$, recovering the human-only item parameters. As such, the same $\lambda$ tuning procedure both embraces informative predictions (as in the main example, $\lambda \to N/(n+N)$) and rejects an uninformative predictions ($\lambda \approx 0$).

Validation

A full simulation study confirms the recommended workflow behaves as intended:

$\mathbf{\lambda}$-selection tracks predictor quality. A perfect paired predictor $(F = Y)$ selects $\lambda \approx N/(n+N)$; a useless predictor is down-weighted to $\lambda \approx 0$.
The Louis-corrected standard errors provide correct coverage. Across all simulation conditions, including a useless or biased predictor, vcov() on a scalar MML fit attains nominal Wald-interval coverage (~0.91/0.96) for both discriminations and intercepts, whereas the uncorrected EM-Hessian covariance under-covers (~0.71/0.79).
No average harm done to ability scoring. The tuned ability-score RMSE is no worse than the human-only calibration on average (every regime’s mean $\Delta \text{RMSE} \leq 0$) when the predictor is uninformative.

See the Simulation Validation vignette for the full results, and simulations/ in the source tree for the reproduction code.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.

Mixed-Subjects IRT Calibration

Simulate example data

Step 1: Fit the human baseline

Step 2: Fit the MML mixed-subjects model

Step 3: Select \(\lambda\) by ability-score risk

Step 3b (recommended workflow): cross-fit \(\lambda\) tuning

Step 4: Inspect the covariance

Compare calibrations

When the LLM is uninformative

Validation