The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Mixed-Subjects IRT Calibration

This vignette shows the recommended mixed-subjects workflow for a unidimensional 2PL model using the marginal maximum-likelihood (MML) estimator (fit_mixed_subjects_mml). The package expects three item response matrices with no respondent IDs and the same ordering of item columns:

All three matrices must contain binary responses \((Y_{ij} \in \{0,1\})\). Probability (fractional) predictions are not accepted, as they are not a valid likelihood input for the MML IRT objective and break the correction. If your prediction model outputs probabilities, sample binary responses from them first (e.g. rbinom).

The fitted objective is

\[L_o^{\mathrm{marg}}(\gamma) + \lambda\bigl[ L_g^{\mathrm{marg}}(\gamma) - L_p^{\mathrm{marg}}(\gamma)\bigr]\]

where \(\gamma\) is a vector of item parameters and each \(L^\mathrm{marg}\) is the true IRT marginal negative log-likelihood, with posteriors recomputed from the current candidate \(\gamma\) at every gradient step. Setting \(\lambda = 0\) recovers the human-only MML calibration. See the Choosing Lambda vignette for the specific background on why the MML objective is preferred over expected-count based estimators.

Simulate example data

library(mixedsubjectsirt)
library(ggplot2)

set.seed(242424)

n_human    <- 400
n_generated <- 1200
n_items    <- 8

true_pars <- data.frame(
  item = paste0("Item", seq_len(n_items)),
  a    = seq(0.8, 1.6, length.out = n_items),
  d    = seq(-1.1, 1.1, length.out = n_items)
)
true_pars$b <- -true_pars$d / true_pars$a

theta_human <- rnorm(n_human)
observed    <- simulate_2pl(theta_human, true_pars)

# Strongly informative predictions. On the n labeled subjects, predictions
# match human responses (the F = Y benchmark from the simulation study), and
# N additional unlabeled responses are drawn from the same 2PL. This is
# the "good predictor" regime, where the method should lean on unlabeled data.
# (Further down we show the complementary case of an uninformative predictor,
# where the criterion instead drives lambda to 0.)
predicted <- observed
generated <- simulate_2pl(rnorm(n_generated), true_pars)

Step 1: Fit the human baseline

Baseline models are estimated using mirt.1

human_start <- fit_2pl(observed, technical = list(NCYCLES = 500))

Step 2: Fit the MML mixed-subjects model

fit_mixed_subjects_mml() uses a MML-based EM procedure for iterative estimation at a given \(\lambda\) value. If you have a value of \(\lambda\) already (say from a pilot study or previous ability tuning), the model can be estimated directly here.

mixed_mml <- fit_mixed_subjects_mml(
  observed  = observed,
  predicted = predicted,
  generated = generated,
  lambda    = 0.5,
  initial_pars = human_start$pars
)

mixed_mml
#> mixedsubjectsirt 2PL fit
#>   items:      8
#>   lambda:     0.5
#>   loss:       4.91044
#>   convergence: 0 
#>   estimator:  marginal MML PPI++
mixed_mml$item_pars
#>    item         a          d          b
#> 1 Item1 0.7431515 -1.0825772  1.4567382
#> 2 Item2 1.0067308 -0.7278419  0.7229757
#> 3 Item3 0.8547961 -0.4113346  0.4812078
#> 4 Item4 1.0999070 -0.1453192  0.1321195
#> 5 Item5 1.1673630  0.1875514 -0.1606624
#> 6 Item6 1.2566356  0.4923054 -0.3917646
#> 7 Item7 1.5526271  0.8477928 -0.5460376
#> 8 Item8 1.5472343  1.1508445 -0.7438075

Step 3: Select \(\lambda\) by ability-score risk

tune_lambda_ability_risk() with fit_fn = fit_mixed_subjects_mml selects the \(\lambda\) that minimizes propagated ability-score risk \(\mathbb{E}\big[g'\Sigma_\gamma g\big]\) (where \(\Sigma_\gamma\) is the Louis-corrected marginal sandwich covariance).2 By default this is done by direct 1-D optimization over [0, 1]. The final fit from this optimal \(\lambda\) is also returned as the best_fit object within the output list, so the user is not required to call fit_mixed_subjects_mml again.

ability_tuned <- tune_lambda_ability_risk(
  observed     = observed,
  predicted    = predicted,
  generated    = generated,
  target_resp  = observed,
  initial_pars = human_start$pars,
  fit_fn       = fit_mixed_subjects_mml,
  control      = list(maxit = 200)
)

ability_tuned$best_lambda
#> [1] 0.7924548

Because the predictor is highly informative, the approach selects a \(\lambda\) near the theoretical maximum \(N/(n+N) = 1200/1600 = 0.75\). As such, the method leans heavily on the unlabeled response data. (The returned summary records each \(\lambda\) the optimizer evaluated, with selection_risk = Inf for any that failed to converge, so selection is protected against numerical failures. When the predictor is uninformative the same criterion drives \(\lambda\) to 0 instead; see below.)

Step 4: Inspect the covariance

vcov() on a scalar-lambda MML fit automatically uses vcov_mixed_subjects_mml(), which applies Louis’ observed-information correction.4 Here, the bread is \(H_\mathrm{comp} - I_\mathrm{miss}\) rather than the EM complete-data Hessian alone.

Sigma <- vcov(cf_tuned$final_fit)
dim(Sigma)  # 2J × 2J
#> [1] 16 16

Compare calibrations

human_only <- fit_mixed_subjects_mml(
  observed  = observed,
  predicted = predicted,
  generated = generated,
  lambda    = 0,
  initial_pars = human_start$pars
)

estimates <- rbind(
  data.frame(estimator = "human only",  human_only$item_pars),
  data.frame(estimator = "MML lambda = 0.5", mixed_mml$item_pars),
  data.frame(estimator = "MML ability-risk",
             ability_tuned$best_fit$item_pars)
)
estimates$true_b <- true_pars$b[match(estimates$item, true_pars$item)]
estimates$true_a <- true_pars$a[match(estimates$item, true_pars$item)]

ggplot(estimates, aes(true_b, b, colour = estimator)) +
  geom_abline(slope = 1, intercept = 0, linewidth = 0.4) +
  geom_point(size = 2) +
  labs(x = "True difficulty", y = "Estimated difficulty", colour = NULL) +
  theme_minimal()

When the LLM is uninformative

While the method is able to efficiently exploit information derived from a good predictor, it is also capable of rejecting useless information derived from a bad one. Here we simulate responses that are essentially unrelated to ability (drawn from scrambled item parameters), so the paired correction between observed and predicted carries no usable signal:

set.seed(242424)
bad_pars    <- true_pars
bad_pars$a  <- pmax(0.05, abs(rnorm(n_items, 0, 0.1)))   # near-zero slopes
bad_pars$d  <- rnorm(n_items, 0, 2)                       # difficulties unrelated to truth

predicted_bad <- simulate_2pl(theta_human, bad_pars)
generated_bad <- simulate_2pl(rnorm(n_generated), bad_pars)

tuned_bad <- tune_lambda_ability_risk(
  observed     = observed,
  predicted    = predicted_bad,
  generated    = generated_bad,
  target_resp  = observed,
  initial_pars = human_start$pars,
  fit_fn       = fit_mixed_subjects_mml,
  control      = list(maxit = 200)
)

tuned_bad$best_lambda   # expect ~0: the useless LLM is correctly ignored
#> [1] 0.03332212

The criterion drives \(\lambda \to 0\), recovering the human-only item parameters. As such, the same \(\lambda\) tuning procedure both embraces informative predictions (as in the main example, \(\lambda \to N/(n+N)\)) and rejects an uninformative predictions (\(\lambda \approx 0\)).

Validation

A full simulation study confirms the recommended workflow behaves as intended:

See the Simulation Validation vignette for the full results, and simulations/ in the source tree for the reproduction code.


  1. Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the R environment._Journal of Statistical Software_, 48, 1-29.↩︎

  2. Liu, C. W., & Chalmers, R. P. (2021). A note on computing Louis’ observed information matrix identity for IRT and cognitive diagnostic models. British Journal of Mathematical and Statistical Psychology, 74(1), 118-138.↩︎

  3. Mani, P., Xu, P., Lipton, Z. C., & Oberst, M. (2025). No free lunch: Non-asymptotic analysis of prediction-powered inference. arXiv preprint arXiv:2505.20178.↩︎

  4. Louis, T. A. (1982). Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society Series B: Statistical Methodology, 44(2), 226-233.↩︎

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.