The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
When AI systems score essays, short-answer responses, or structured tasks, a critical fairness question arises: does the AI scoring engine shift item difficulties differently for different demographic groups?
Classical DIF methods test whether an item performs differently
across groups within a single scoring condition.
aiDIF extends this to a paired design:
make_aidif_eg() returns a built-in example with item
parameter MLEs for 6 items in two groups under both scoring conditions.
The planted structure is:
fit_aidif() runs the robust IRLS engine under each
scoring condition and performs the DASB test.
mod <- fit_aidif(
human_mle = eg$human,
ai_mle = eg$ai,
alpha = 0.05
)
print(mod)
#> AI-DIF Analysis
#> ----------------------------------------
#> Human scoring — robust scale est: -0.5776 (SE: 0.0747)
#> — DIF items flagged: 3 / 6
#> AI scoring — robust scale est: -0.5921 (SE: 0.0748)
#> — DIF items flagged: 3 / 6
#> DASB test — items with differential AI bias: 1 / 6summary(mod)
#> =============================================================
#> AI Differential Item Functioning Analysis (aiDIF)
#> =============================================================
#>
#> --- Human Scoring DIF ----------------------------------------
#> Robust scale estimate: -0.5776 (SE: 0.0747)
#> Wald DIF tests:
#> delta se z p_val
#> item1_d1 0.5693 0.0759 7.4995 0.0000
#> item2_d1 0.0366 0.1060 0.3448 0.7303
#> item3_d1 0.2302 0.0623 3.6953 0.0002
#> item4_d1 0.0163 0.0931 0.1756 0.8606
#> item5_d1 0.2700 0.0693 3.8947 0.0001
#> item6_d1 -0.1181 0.1232 -0.9584 0.3379
#>
#> --- AI Scoring DIF -------------------------------------------
#> Robust scale estimate: -0.5921 (SE: 0.0748)
#> Wald DIF tests:
#> delta se z p_val
#> item1_d1 0.5756 0.0761 7.5596 0.0000
#> item2_d1 0.0466 0.1046 0.4458 0.6557
#> item3_d1 0.5499 0.0619 8.8820 0.0000
#> item4_d1 0.0046 0.0926 0.0495 0.9605
#> item5_d1 0.3308 0.0695 4.7559 0.0000
#> item6_d1 -0.1455 0.1240 -1.1737 0.2405
#>
#> --- Differential AI Scoring Bias (DASB) ---------------------
#> H0: AI scoring shift does not differ across groups
#> (Positive DASB => AI scoring disadvantages focal group)
#>
#> shift_g1 shift_g2 DASB se z p_val
#> item1 0.13 0.12 -0.01 0.14 -0.071 0.9431
#> item2 0.08 0.07 -0.01 0.14 -0.071 0.9431
#> item3 0.11 0.54 0.43 0.14 3.071 0.0021
#> item4 0.12 0.09 -0.03 0.14 -0.214 0.8303
#> item5 0.07 0.13 0.06 0.14 0.429 0.6682
#> item6 0.11 0.08 -0.03 0.14 -0.214 0.8303
#>
#> --- AI-Effect Classification ---------------------------------
#> stable_clean : not flagged in either condition
#> stable_dif : flagged in both (same direction)
#> introduced : flagged only under AI scoring
#> masked : flagged only under human scoring
#> new_direction : flagged in both, opposite direction
#>
#> human_delta ai_delta human_flag ai_flag status
#> item1_d1 0.5693 0.5756 TRUE TRUE stable_dif
#> item2_d1 0.0366 0.0466 FALSE FALSE stable_clean
#> item3_d1 0.2302 0.5499 TRUE TRUE stable_dif
#> item4_d1 0.0163 0.0046 FALSE FALSE stable_clean
#> item5_d1 0.2700 0.3308 TRUE TRUE stable_dif
#> item6_d1 -0.1181 -0.1455 FALSE FALSE stable_clean
#>
#> Status counts:
#>
#> stable_clean stable_dif
#> 3 3scoring_bias_test() can also be called directly.
sb <- scoring_bias_test(eg$human, eg$ai)
print(sb)
#> shift_g1 shift_g2 DASB se z p_val
#> item1 0.13 0.12 -0.01 0.14 -0.071 0.9431
#> item2 0.08 0.07 -0.01 0.14 -0.071 0.9431
#> item3 0.11 0.54 0.43 0.14 3.071 0.0021
#> item4 0.12 0.09 -0.03 0.14 -0.214 0.8303
#> item5 0.07 0.13 0.06 0.14 0.429 0.6682
#> item6 0.11 0.08 -0.03 0.14 -0.214 0.8303Item 3 should be significant, reflecting the planted group-dependent AI scoring bias.
eff <- ai_effect_summary(mod$dif_human, mod$dif_ai)
print(eff)
#> human_delta ai_delta human_flag ai_flag status
#> item1_d1 0.5693 0.5756 TRUE TRUE stable_dif
#> item2_d1 0.0366 0.0466 FALSE FALSE stable_clean
#> item3_d1 0.2302 0.5499 TRUE TRUE stable_dif
#> item4_d1 0.0163 0.0046 FALSE FALSE stable_clean
#> item5_d1 0.2700 0.3308 TRUE TRUE stable_dif
#> item6_d1 -0.1181 -0.1455 FALSE FALSE stable_clean| Status | Meaning |
|---|---|
introduced |
AI scoring creates DIF not present under human scoring |
masked |
AI scoring hides DIF that existed under human scoring |
stable_dif |
DIF detected in both conditions |
stable_clean |
No DIF in either condition |
dat <- simulate_aidif_data(
n_items = 8,
n_obs = 600,
dif_items = c(1, 2),
dif_mag = 0.5,
dasb_items = 5,
dasb_mag = 0.4,
seed = 123
)
sim_mod <- fit_aidif(dat$human, dat$ai)
print(sim_mod)
#> AI-DIF Analysis
#> ----------------------------------------
#> Human scoring — robust scale est: -0.2670 (SE: 0.0322)
#> — DIF items flagged: 4 / 8
#> AI scoring — robust scale est: 0.0536 (SE: 0.0363)
#> — DIF items flagged: 5 / 8
#> DASB test — items with differential AI bias: 1 / 8These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.