Introduction to aiDIF: Detecting Differential Item Functioning in AI-Scored Assessments

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Background

When AI systems score essays, short-answer responses, or structured tasks, a critical fairness question arises: does the AI scoring engine shift item difficulties differently for different demographic groups?

Classical DIF methods test whether an item performs differently across groups within a single scoring condition. aiDIF extends this to a paired design:

Human-scoring DIF — robust M-estimation of item-level bias
AI-scoring DIF — the same analysis applied to AI-scored data
Differential AI Scoring Bias (DASB) — a new test for group-dependent parameter shifts from human to AI scoring

The Example Dataset

make_aidif_eg() returns a built-in example with item parameter MLEs for 6 items in two groups under both scoring conditions. The planted structure is:

Item 1: DIF in human scoring (intercept +0.5 in focal group)
Item 3: DASB — AI scoring adds +0.4 to the focal group intercept only
Impact: 0.5 SD (focal group higher on the latent trait)
AI drift: +0.1 uniform calibration offset across all items

eg <- make_aidif_eg()
str(eg, max.level = 2)
#> List of 2
#>  $ human:List of 3
#>   ..$ par.names:List of 2
#>   ..$ est      :List of 2
#>   ..$ var.cov  :List of 2
#>  $ ai   :List of 3
#>   ..$ par.names:List of 2
#>   ..$ est      :List of 2
#>   ..$ var.cov  :List of 2

Fitting the Model

fit_aidif() runs the robust IRLS engine under each scoring condition and performs the DASB test.

mod <- fit_aidif(
  human_mle = eg$human,
  ai_mle    = eg$ai,
  alpha     = 0.05
)
print(mod)
#> AI-DIF Analysis
#> ----------------------------------------
#> Human scoring  — robust scale est: -0.5776  (SE: 0.0747)
#>                — DIF items flagged: 3 / 6
#> AI scoring     — robust scale est: -0.5921  (SE: 0.0748)
#>                — DIF items flagged: 3 / 6
#> DASB test      — items with differential AI bias: 1 / 6

Full Report

summary(mod)
#> =============================================================
#>  AI Differential Item Functioning Analysis (aiDIF)
#> =============================================================
#> 
#> --- Human Scoring DIF ----------------------------------------
#>   Robust scale estimate:  -0.5776  (SE: 0.0747)
#>   Wald DIF tests:
#>            delta     se       z  p_val
#> item1_d1  0.5693 0.0759  7.4995 0.0000
#> item2_d1  0.0366 0.1060  0.3448 0.7303
#> item3_d1  0.2302 0.0623  3.6953 0.0002
#> item4_d1  0.0163 0.0931  0.1756 0.8606
#> item5_d1  0.2700 0.0693  3.8947 0.0001
#> item6_d1 -0.1181 0.1232 -0.9584 0.3379
#> 
#> --- AI Scoring DIF -------------------------------------------
#>   Robust scale estimate:  -0.5921  (SE: 0.0748)
#>   Wald DIF tests:
#>            delta     se       z  p_val
#> item1_d1  0.5756 0.0761  7.5596 0.0000
#> item2_d1  0.0466 0.1046  0.4458 0.6557
#> item3_d1  0.5499 0.0619  8.8820 0.0000
#> item4_d1  0.0046 0.0926  0.0495 0.9605
#> item5_d1  0.3308 0.0695  4.7559 0.0000
#> item6_d1 -0.1455 0.1240 -1.1737 0.2405
#> 
#> --- Differential AI Scoring Bias (DASB) ---------------------
#>   H0: AI scoring shift does not differ across groups
#>   (Positive DASB => AI scoring disadvantages focal group)
#> 
#>       shift_g1 shift_g2  DASB   se      z  p_val
#> item1     0.13     0.12 -0.01 0.14 -0.071 0.9431
#> item2     0.08     0.07 -0.01 0.14 -0.071 0.9431
#> item3     0.11     0.54  0.43 0.14  3.071 0.0021
#> item4     0.12     0.09 -0.03 0.14 -0.214 0.8303
#> item5     0.07     0.13  0.06 0.14  0.429 0.6682
#> item6     0.11     0.08 -0.03 0.14 -0.214 0.8303
#> 
#> --- AI-Effect Classification ---------------------------------
#>   stable_clean  : not flagged in either condition
#>   stable_dif    : flagged in both (same direction)
#>   introduced    : flagged only under AI scoring
#>   masked        : flagged only under human scoring
#>   new_direction : flagged in both, opposite direction
#> 
#>          human_delta ai_delta human_flag ai_flag       status
#> item1_d1      0.5693   0.5756       TRUE    TRUE   stable_dif
#> item2_d1      0.0366   0.0466      FALSE   FALSE stable_clean
#> item3_d1      0.2302   0.5499       TRUE    TRUE   stable_dif
#> item4_d1      0.0163   0.0046      FALSE   FALSE stable_clean
#> item5_d1      0.2700   0.3308       TRUE    TRUE   stable_dif
#> item6_d1     -0.1181  -0.1455      FALSE   FALSE stable_clean
#> 
#>   Status counts:
#> 
#> stable_clean   stable_dif 
#>            3            3

The DASB Test

scoring_bias_test() can also be called directly.

sb <- scoring_bias_test(eg$human, eg$ai)
print(sb)
#>       shift_g1 shift_g2  DASB   se      z  p_val
#> item1     0.13     0.12 -0.01 0.14 -0.071 0.9431
#> item2     0.08     0.07 -0.01 0.14 -0.071 0.9431
#> item3     0.11     0.54  0.43 0.14  3.071 0.0021
#> item4     0.12     0.09 -0.03 0.14 -0.214 0.8303
#> item5     0.07     0.13  0.06 0.14  0.429 0.6682
#> item6     0.11     0.08 -0.03 0.14 -0.214 0.8303

Item 3 should be significant, reflecting the planted group-dependent AI scoring bias.

AI-Effect Classification

eff <- ai_effect_summary(mod$dif_human, mod$dif_ai)
print(eff)
#>          human_delta ai_delta human_flag ai_flag       status
#> item1_d1      0.5693   0.5756       TRUE    TRUE   stable_dif
#> item2_d1      0.0366   0.0466      FALSE   FALSE stable_clean
#> item3_d1      0.2302   0.5499       TRUE    TRUE   stable_dif
#> item4_d1      0.0163   0.0046      FALSE   FALSE stable_clean
#> item5_d1      0.2700   0.3308       TRUE    TRUE   stable_dif
#> item6_d1     -0.1181  -0.1455      FALSE   FALSE stable_clean

Status	Meaning
`introduced`	AI scoring creates DIF not present under human scoring
`masked`	AI scoring hides DIF that existed under human scoring
`stable_dif`	DIF detected in both conditions
`stable_clean`	No DIF in either condition

Visualisations

plot(mod, type = "dif_forest")   # human vs AI DIF side by side
plot(mod, type = "dasb")         # DASB bar chart with error bars
plot(mod, type = "weights")      # bi-square anchor weights

Simulation

dat <- simulate_aidif_data(
  n_items    = 8,
  n_obs      = 600,
  dif_items  = c(1, 2),
  dif_mag    = 0.5,
  dasb_items = 5,
  dasb_mag   = 0.4,
  seed       = 123
)
sim_mod <- fit_aidif(dat$human, dat$ai)
print(sim_mod)
#> AI-DIF Analysis
#> ----------------------------------------
#> Human scoring  — robust scale est: -0.2670  (SE: 0.0322)
#>                — DIF items flagged: 4 / 8
#> AI scoring     — robust scale est: 0.0536  (SE: 0.0363)
#>                — DIF items flagged: 5 / 8
#> DASB test      — items with differential AI bias: 1 / 8

References

Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In Test validity (pp. 129–145). Erlbaum.
Halpin, P., Nickodem, K., & Eagle, J. (2024). robustDIF: Differential Item Functioning Using Robust Scaling. R package version 0.2.0.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.