---
title: "Choosing Lambda in Mixed-Subjects IRT"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Choosing Lambda in Mixed-Subjects IRT}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 4
)
```

The mixed-subjects estimator calibrates a 2PL IRT model by minimizing a PPI++-style combined loss over observed human responses, paired predicted responses, and additional generated responses. Predicted and generated responses will, in practice, typically come from an LLM.

This vignette explains how to choose $\lambda$ and which functions to use. The
short answer:

| Task | Function |
|------|----------|
| **Recommended workflow (cross-fit)** | **`tune_lambda_ability_risk_crossfit()`** |
| Exploratory tuning | `tune_lambda_ability_risk()` |
| Fast approximation (frozen EC) | `tune_lambda_ability_risk(..., fit_fn = fit_mixed_subjects)` |
| Theoretical diagnostic | `tune_lambda_ppi_score()` |
| Per-item tuning (experimental) | `tune_lambda_ability_risk_item()` |


All of these now use the marginal-MML estimator ([`fit_mixed_subjects_mml()`])
by default; the frozen expected-count estimator is available via
`fit_fn = fit_mixed_subjects` but is discouraged (see below).

## Two objectives, two estimators

**Why there are two lambda objectives:**

- `tune_lambda_ppi_score()` minimizes $\text{Tr}(\Sigma_\gamma)$. This is the trace of
  the item-parameter covariance matrix and the tuning rule from the original PPI++ implementation.[^1] This is most useful as a tuning target when doing inference on estimated parameters (such as in causal inference applications).[^2] Here, this is merely implemented as a theoretical diagnostic.

- `tune_lambda_ability_risk()` minimizes $E[g'\Sigma_\gamma g]$. This is the propagated
  ability-score risk. This is the preferred practical criterion for IRT applications and optimizes the choice of $\lambda$ to minimize measurement error in downstream ability estimation. Importantly, the PPI++ objective only cares about the trace of the $\Sigma_\gamma$, while this method uses information about scale structure derived from the off-diagonal covariance terms. Use this for operational calibration.

**Why there are two estimators:**

`fit_mixed_subjects_mml()` is the default estimator for every tuner _except_ `tune_lambda_ability_risk_item`. It is a consistent iterative MML-based estimator that marginalizes over the ability distribution. The implementation recomputes posteriors at every gradient evaluation, and its sandwich covariance (`vcov_mixed_subjects_mml`, called automatically via `vcov()`) uses Louis' observed-information correction for the bread, providing proper coverage.[^3]

The original `fit_mixed_subjects()` uses frozen expected-count posteriors and is still available by passing `fit_fn = fit_mixed_subjects`, but it is **highly discouraged**: the frozen posteriors create a gradient asymmetry when the LLM item parameters differ from human parameters, systematically inflating discriminations and driving `tune_lambda_ability_risk` to select $\lambda = 0$ even when the LLM is genuinely informative (it also requires a `slope_upper` cap for stability). Use it only as a fast approximation when computation time is a binding
constraint. Note that the current implementation of `tune_lambda_ability_risk_item` uses this estimation procedure and should be considered experimental.

## Example data

```{r data}
library(mixedsubjectsirt)

set.seed(242424)

n_human     <- 120
n_generated <- 350
n_items     <- 5

human_pars <- data.frame(
  item = paste0("Item", seq_len(n_items)),
  a    = c(0.8, 1.0, 1.2, 1.35, 1.55),
  d    = c(-0.8, -0.3, 0.1, 0.45, 0.9)
)
human_pars$b <- -human_pars$d / human_pars$a

llm_pars   <- human_pars
llm_pars$a <- pmax(0.35, 0.9 * human_pars$a)
llm_pars$d <- human_pars$d + 0.25
llm_pars$b <- -llm_pars$d / llm_pars$a

theta_human <- rnorm(n_human)
observed    <- simulate_2pl(theta_human, human_pars)
predicted   <- simulate_2pl(theta_human, llm_pars)
generated   <- simulate_2pl(rnorm(n_generated), llm_pars)

lambda_grid <- c(0, 0.25, 0.5, 0.75, 1)
```

The examples use `human_pars` as `initial_pars` for speed.

## Ability-risk tuning: Minimizing $\mathbb{E}[g'\Sigma_\gamma g]$

The key result from the [Linking and Gradient Asymmetry](linking-comparison.html) vignette is that when the LLM parameters differ from human parameters, the frozen expected-count estimator drives $\lambda \to 0$ due to an artificial gradient asymmetry. The default MML estimator removes this asymmetry, so with a good predictor ($F = Y$) `tune_lambda_ability_risk()` correctly selects $\lambda > 0$. Because `fit_mixed_subjects_mml()` is the default `fit_fn`, no estimator argument is needed.

By default `tune_lambda_ability_risk()` chooses $\lambda$ by direct optimization of the risk over `[0, 1]`. Here we pass `method = "grid"` with `lambda_grid` so the `summary` shows the risk at each candidate, which handy for visualizing the $\lambda$-risk surface. In practice you will likely omit both arguments and let it optimize $\lambda$ directly.

```{r ability-tuning-mml}
ability_tuned_mml <- tune_lambda_ability_risk(
  lambda_grid  = lambda_grid,
  observed     = observed,
  predicted    = predicted,
  generated    = generated,
  target_resp  = observed,
  initial_pars = human_pars,
  method       = "grid",                # show the risk at each candidate value
  control      = list(maxit = 100)
)

ability_tuned_mml$summary[, c("lambda", "mean_param_var", "mean_total_risk",
                               "convergence", "selection_risk")]
ability_tuned_mml$best_lambda

tune_lambda_ability_risk(
  observed = observed, predicted = predicted, generated = generated,
  target_resp = observed, initial_pars = human_pars,
  control = list(maxit = 100)
)$best_lambda
```

`selection_risk` is `Inf` for any candidate with non-zero convergence code or
non-finite risk, protecting selection from numerical failures.

We can inspect the components of the ability risk:

```{r risk-components}

risk <- ability_risk(
  resp        = observed,
  fit_or_pars = ability_tuned_mml$best_fit,
  vcov        = vcov(ability_tuned_mml$best_fit)
)
risk$summary
```

`mean_squared_error` is `NA` because `theta_true` was not supplied; it is only computed in simulation studies. `mean_param_var` is the propagated item-parameter uncertainty. `mean_total_risk` is the ability risk target that the method optimizes for.

In simulation studies, pass `theta_true = theta_human` to also include squared ability-estimation error:

```{r risk-with-truth}
ability_tuned_truth <- tune_lambda_ability_risk(
  lambda_grid  = lambda_grid,
  observed     = observed,
  predicted    = predicted,
  generated    = generated,
  target_resp  = observed,
  theta_true   = theta_human,
  initial_pars = human_pars,
  fit_fn       = fit_mixed_subjects_mml,
  method       = "grid",
  control      = list(maxit = 100)
)

ability_tuned_truth$summary[, c("lambda", "mean_param_var",
                                 "mean_squared_error", "mean_total_risk")]
```

## Cross-fit $\lambda$ tuning (recommended workflow)

`tune_lambda_ability_risk_crossfit()` is the recommended workflow for all serious analyses: it estimates $\lambda$ per fold on the *other* folds' labels, so the selected $\lambda$ is not tuned on the same rows it is applied to. By default both the fold tuning (`fit_fn`) and the final fit (`final_fit_fn`) use the MML estimator, and the fold $\lambda$'s are averaged (weighted by fold size) into a single scalar for the final full-sample fit. Two further arguments:

- `target_mode = "fixed"` (default): the full `target_resp` is used for every fold's risk evaluation, which is correct when the target is an operational scoring population independent of the calibration sample.
- `target_mode = "row_aligned"`: subsets `target_resp` to training rows, valid when `target_resp = observed`.

```{r crossfit}
split_id <- rep(1:2, length.out = nrow(observed))

crossfit_tuned <- tune_lambda_ability_risk_crossfit(
  lambda_grid  = c(0, 0.5, 1),
  observed     = observed,
  predicted    = predicted,
  generated    = generated,
  split_id     = split_id,
  initial_pars = human_pars,
  n_quad       = 7,
  control      = list(maxit = 100)
)

crossfit_tuned$lambda_by_split   # per-fold tuned lambda
crossfit_tuned$lambda_final      # fold-size-weighted scalar used for the final fit
```

## Frozen expected-count estimator (fast approximation)

Passing `fit_fn = fit_mixed_subjects` selects the older frozen expected-count estimator. It is faster than the default MML path but produces inflated discriminations when the LLM parameters differ from human parameters, driving $\lambda \to 0$ even when the LLM is informative. As such, it is **strongly discouraged**.

```{r frozen-ec}
ability_tuned_ec <- tune_lambda_ability_risk(
  lambda_grid  = lambda_grid,
  observed     = observed,
  predicted    = predicted,
  generated    = generated,
  target_resp  = observed,
  initial_pars = human_pars,
  fit_fn       = fit_mixed_subjects,   # opt in to the frozen EC estimator
  n_quad       = 7,
  slope_upper  = 4,           # required to prevent divergence
  control      = list(maxit = 100)
)

ability_tuned_ec$best_lambda
```

`slope_upper = 4` is required to prevent discriminations from diverging. The
default MML estimator does not need this cap because it has no false minimum at
large discrimination values.

## Minimizing $\text{Tr}\big[\Sigma_\gamma\big]$ (diagnostic only)

`tune_lambda_ppi_score()` estimates the PPI++ Proposition 2 lambda using the
same human posterior weights for both human and paired-LLM score vectors.
F = Y (identical predictions) gives exactly $N/(n+N)$.

```{r ppi-score}
ppi_score <- tune_lambda_ppi_score(
  observed    = observed,
  predicted   = predicted,
  item_pars   = human_pars,
  n_generated = nrow(generated)
)

cat("PPI++ score lambda:", round(ppi_score$lambda, 3),
    " r =", round(ppi_score$r, 3),
    " N/(n+N) =", round(1 / (1 + ppi_score$r), 3), "\n")
```

A value near zero means the paired LLM responses do not systematically reduce
gradient variance in the person-level score formulation. This is expected for
stochastic binary LLM responses. For $F = Y$ the formula recovers $N/(n+N)$.

```{r ppi-score-2}
ppi_score_match <- tune_lambda_ppi_score(
  observed    = observed,
  predicted   = observed,
  item_pars   = human_pars,
  n_generated = nrow(generated)
)

cat("PPI++ score lambda:", round(ppi_score_match$lambda, 3),
    " r =", round(ppi_score$r, 3),
    " N/(n+N) =", round(1 / (1 + ppi_score_match$r), 3), "\n")
```

## Choosing a procedure

| Procedure | Objective | When to use |
|-----------|-----------|-------------|
| `tune_lambda_ability_risk_crossfit()` | $\mathbb{E}[g'\Sigma_\gamma g]$, computed out-of-fold | **Recommended workflow** |
| `tune_lambda_ability_risk()` | $\mathbb{E}[g'\Sigma_\gamma g]$, Louis bread (MML default) | Exploratory single-sample tuning |
| `tune_lambda_ability_risk(..., fit_fn = fit_mixed_subjects)` | $\mathbb{E}[g'\Sigma_\gamma g]$, EM bread | Fast approximation; discouraged, requires `slope_upper` |
| `tune_lambda_ability_risk_item()` | Per-item risk (approx.) | Experimental; some items poor predictors |
| `tune_lambda_ppi_score()` | $\text{Tr}(\Sigma_\gamma)$ | Theoretical diagnostic only |

Also note that the target population matters, as ability risk is integrated over that population's ability distribution. `target_resp = observed` tunes for the observed human response patterns. In operational scoring, you may use a larger target matrix representing the actual scoring population while calibrating item parameters on a pilot sample.