The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Experimental. Per-item lambda uses a frozen expected-count approximation, not the full marginal-MML objective. The IRT marginal likelihood integrates the joint response pattern and does not decompose item-wise, so a theoretically principled per-item marginal objective does not yet exist in this package. Results from per-item tuning should be treated as approximate and validated against the scalar-lambda MML baseline. See the documentation of
tune_lambda_ability_risk_item()for details.
A single global \(\lambda\) is fragile: if even one item has poorly-correlated LLM predictions, the ability-risk criterion may force the global optimum to \(\lambda = 0\), preventing all items from benefiting. Per-item \(\lambda_j\) allows each item to draw on the LLM data at its own optimal level.
Consider a test with eight items where: - Items 1–4 are straightforward factual questions — the LLM predicts these well. - Items 5–8 require nuanced reasoning — the LLM is nearly random on these.
With a scalar \(\lambda\), the four poor items push the optimum toward 0. With per-item \(\lambda_j\), items 1–4 get \(\lambda_j \approx 0.5\)–0.7 while items 5–8 get \(\lambda_j \approx 0\).
library(mixedsubjectsirt)
library(ggplot2)
set.seed(2026)
n_human <- 400
n_generated <- 1200
n_items <- 8
n_good <- 4 # items where LLM predicts well
true_pars <- data.frame(
item = paste0("Item", seq_len(n_items)),
a = seq(0.8, 1.6, length.out = n_items),
d = seq(-1.1, 1.1, length.out = n_items)
)
true_pars$b <- -true_pars$d / true_pars$a
theta_human <- rnorm(n_human)
observed <- simulate_2pl(theta_human, true_pars)
# LLM: good for items 1–4 (same DGP), poor for 5–8 (random noise)
llm_pars_good <- true_pars
llm_pars_poor <- true_pars
llm_pars_poor$a <- pmax(0.05, rnorm(n_items, 0, 0.1)) # near-random
llm_pars_poor$d <- rnorm(n_items, 0, 2)
llm_pars_poor$b <- -llm_pars_poor$d / llm_pars_poor$a
# Build predicted (same subjects as human)
predicted <- observed # F = Y for first 4 items
predicted[, 5:8] <- simulate_2pl(theta_human, llm_pars_poor)[, 5:8]
# Build generated
generated_good <- simulate_2pl(rnorm(n_generated), true_pars)
generated_poor <- simulate_2pl(rnorm(n_generated), llm_pars_poor)
generated <- cbind(generated_good[, 1:4], generated_poor[, 5:8])
colnames(generated) <- true_pars$itemThe first four items have perfect paired predictions (F = Y); items 5–8 have near-random LLM predictions.
human_pars <- fit_2pl(observed, technical = list(NCYCLES = 500))$pars
global_tuned <- tune_lambda_ability_risk(
lambda_grid = seq(0, 1, by = 0.1),
observed = observed,
predicted = predicted,
generated = generated,
initial_pars = human_pars,
fit_fn = fit_mixed_subjects_mml,
n_quad = 11,
control = list(maxit = 200)
)
cat("Global scalar best lambda:", global_tuned$best_lambda, "\n")
#> Global scalar best lambda: 0.3897205The global scalar is forced to a compromise — the four poor items constrain it to a value smaller than what items 1–4 could support.
tune_lambda_ppi_score_item() applies the Proposition 2
formula independently per item using the 2×2 diagonal block of \(H^{-1}\) and the item-level sub-vectors of
the score matrices. This is fast (no fitting required) and shows which
items are well-predicted.
ppi_item <- tune_lambda_ppi_score_item(
observed = observed,
predicted = predicted,
item_pars = human_pars,
n_generated = n_generated,
n_quad = 11
)
cat("Per-item PPI++ lambda:\n")
#> Per-item PPI++ lambda:
print(data.frame(item = ppi_item$item, lambda = round(ppi_item$lambda, 3)))
#> item lambda
#> 1 Item1 0.75
#> 2 Item2 0.75
#> 3 Item3 0.75
#> 4 Item4 0.75
#> 5 Item5 0.00
#> 6 Item6 0.00
#> 7 Item7 0.00
#> 8 Item8 0.00
cat("N/(n+N) upper bound:", round(n_generated / (n_human + n_generated), 3), "\n")
#> N/(n+N) upper bound: 0.75Items 1–4 (F = Y) should show \(\lambda_j \approx N/(n+N) = 0.75\); items 5–8 (random LLM) should show \(\lambda_j \approx 0\).
tune_lambda_ability_risk_item() uses coordinate descent:
for each item \(j\), it finds the \(\lambda_j\) that minimizes ability-score
risk while holding all other \(\lambda_{j'}\) fixed. By default each
coordinate is solved by direct 1-D optimization
(continuous \(\lambda_j\)); pass
method = "grid" to restrict it to the values in
lambda_grid (which otherwise only bounds the search). Each
evaluation fits with the frozen expected-count
Q-function (not the full marginal-MML objective) because the
IRT marginal likelihood does not decompose item-wise. Starting from the
global scalar optimum (not from all-zeros) is essential — see the note
below.
item_tuned <- tune_lambda_ability_risk_item(
lambda_grid = seq(0, 1, by = 0.25),
observed = observed,
predicted = predicted,
generated = generated,
initial_pars = human_pars,
init_lambda = global_tuned$best_lambda, # start from global best
n_quad = 11,
n_pass = 1,
control = list(maxit = 200)
)
cat("Per-item ability-risk lambda:\n")
#> Per-item ability-risk lambda:
print(data.frame(item = item_tuned$item, lambda = round(item_tuned$lambda, 3)))
#> item lambda
#> 1 Item1 0.39
#> 2 Item2 0.39
#> 3 Item3 0.39
#> 4 Item4 0.39
#> 5 Item5 0.00
#> 6 Item6 0.00
#> 7 Item7 0.00
#> 8 Item8 0.00Items 1–4 should receive positive \(\lambda_j\) (good predictor); items 5–8 should be near zero (poor predictor).
fit_scalar <- global_tuned$best_fit
fit_per_item <- item_tuned$final_fit
rmse <- function(x, y) sqrt(mean((x - y)^2))
comparison <- data.frame(
item = true_pars$item,
true_a = round(true_pars$a, 3),
human_a = round(human_pars$a, 3),
scalar_a = round(fit_scalar$item_pars$a, 3),
item_a = if (is.null(fit_per_item)) NA_real_ else
round(fit_per_item$item_pars$a, 3)
)
knitr::kable(comparison, row.names = FALSE,
caption = "Discrimination recovery: scalar lambda vs. per-item lambda")| item | true_a | human_a | scalar_a | item_a |
|---|---|---|---|---|
| Item1 | 0.800 | 0.727 | 0.751 | 0.715 |
| Item2 | 0.914 | 1.256 | 1.161 | 1.109 |
| Item3 | 1.029 | 1.117 | 1.125 | 1.026 |
| Item4 | 1.143 | 1.091 | 1.040 | 1.038 |
| Item5 | 1.257 | 1.758 | 1.583 | 1.766 |
| Item6 | 1.371 | 1.610 | 1.526 | 1.618 |
| Item7 | 1.486 | 1.369 | 1.457 | 1.371 |
| Item8 | 1.600 | 1.925 | 2.227 | 1.921 |
cat("RMSE(a) human-only: ",
round(rmse(human_pars$a, true_pars$a), 4), "\n")
#> RMSE(a) human-only: 0.2645
cat("RMSE(a) scalar MML: ",
round(rmse(fit_scalar$item_pars$a, true_pars$a), 4), "\n")
#> RMSE(a) scalar MML: 0.2755
if (!is.null(fit_per_item)) {
cat("RMSE(a) per-item MML: ",
round(rmse(fit_per_item$item_pars$a, true_pars$a), 4), "\n")
}
#> RMSE(a) per-item MML: 0.2479Starting coordinate descent from all-zeros is not recommended. When all other items are at \(\lambda_j = 0\), each single-item improvement is diluted across the full ability-risk criterion, making improvements hard to detect. The recommended workflow is:
tune_lambda_ability_risk(..., fit_fn = fit_mixed_subjects_mml).init_lambda to
tune_lambda_ability_risk_item().The per-item lambda coordinate descent uses the frozen
Q-function (not the full marginal-MML objective) for each
candidate evaluation. This is necessary because the IRT marginal
likelihood integrates the joint response pattern and does not decompose
item-wise. The approximation is good when initial_pars is
close to the converged parameters. For final reporting, always:
vcov(item_tuned$final_fit) for uncertainty (which
applies vcov_mixed_subjects with the vector-lambda bread
and meat scaling).These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.