The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
This document catalogs all evaluation risks that BORG detects, organized by severity and mechanism.
BORG classifies risks into two categories based on their impact on evaluation validity:
| Category | Impact | BORG Response |
|---|---|---|
| Hard Violation | Results are invalid | Blocks evaluation, requires fix |
| Soft Inflation | Results are biased | Warns, allows with caution |
These make your evaluation results invalid. Any metrics computed with these violations are unreliable.
What: Same row indices appear in both training and test sets.
Why it matters: The model has seen the exact data it’s being tested on. This is the most basic form of leakage.
Detection: Set intersection of
train_idx and test_idx.
data <- data.frame(x = 1:100, y = rnorm(100))
# Accidental overlap
result <- borg_inspect(data, train_idx = 1:60, test_idx = 51:100)
result
#> BorgRisk Assessment
#> ===================
#>
#> Status: INVALID (1 hard violation) — Resistance is futile
#> Hard violations: 1
#> Soft inflations: 0
#> Train indices: 60 rows
#> Test indices: 50 rows
#> Inspected at: 2026-03-29 10:09:29
#>
#> --- HARD VIOLATIONS (must fix) ---
#>
#> [1] index_overlap
#> Train and test indices overlap (10 shared indices). This invalidates evaluation.
#> Source: train_idx/test_idx
#> Affected: 10 indices (first 5: 51, 52, 53, 54, 55)
#> Fix: Recreate train/test split with non-overlapping indicesFix: Ensure indices are mutually exclusive. Use
setdiff() to create non-overlapping sets.
What: Test set contains rows identical to training rows.
Why it matters: Model may have memorized these exact patterns. Even without index overlap, identical feature values constitute leakage.
Detection: Row hashing and comparison (C++ backend for numeric data).
# Data with duplicate rows
dup_data <- rbind(
data.frame(x = 1:5, y = 1:5),
data.frame(x = 1:5, y = 1:5) # Duplicates
)
result <- borg_inspect(dup_data, train_idx = 1:5, test_idx = 6:10)
result
#> BorgRisk Assessment
#> ===================
#>
#> Status: INVALID (1 hard violation) — Resistance is futile
#> Hard violations: 1
#> Soft inflations: 0
#> Train indices: 5 rows
#> Test indices: 5 rows
#> Inspected at: 2026-03-29 10:09:29
#>
#> --- HARD VIOLATIONS (must fix) ---
#>
#> [1] duplicate_rows
#> Test set contains 5 rows identical to training rows (memorization risk)
#> Source: data.frame
#> Affected: 6, 7, 8, 9, 10
#> Fix: Remove duplicate rows or ensure they fall within the same foldFix: Remove duplicate rows before splitting, or ensure splits respect duplicates (keep all copies in same set).
What: Normalization, imputation, or dimensionality reduction fitted on full data before splitting.
Why it matters: Test set statistics influenced the preprocessing parameters applied to training data. Information flows backwards from test to train.
Detection: Recompute statistics on train-only data and compare to stored parameters. Discrepancy indicates leakage.
Supported objects:
| Object Type | Parameters Checked |
|---|---|
caret::preProcess |
$mean, $std |
recipes::recipe |
Step parameters after prep() |
prcomp |
$center, $scale, rotation matrix |
scale() attributes |
center, scale |
# BAD: Scale fitted on all data
scaled_data <- scale(data) # Uses all rows!
train <- scaled_data[1:70, ]
test <- scaled_data[71:100, ]
# BORG detects this
borg_inspect(scaled_data, train_idx = 1:70, test_idx = 71:100)Fix: Fit preprocessing on training data only, then apply to test:
What: Feature has absolute correlation > 0.99 with target.
Why it matters: Feature is almost certainly derived
from the outcome. Examples: - days_since_diagnosis when
predicting has_disease
total_spent when predicting
is_customer
Aggregated future values leaked into current features
Detection: Compute Pearson correlation of each numeric feature with target on training data.
# Simulate target leakage
leaky <- data.frame(
x = rnorm(100),
outcome = rnorm(100)
)
leaky$leaked <- leaky$outcome + rnorm(100, sd = 0.01) # Near-perfect correlation
result <- borg_inspect(leaky, train_idx = 1:70, test_idx = 71:100, target = "outcome")
result
#> BorgRisk Assessment
#> ===================
#>
#> Status: INVALID (1 hard violation) — Resistance is futile
#> Hard violations: 1
#> Soft inflations: 0
#> Train indices: 70 rows
#> Test indices: 30 rows
#> Inspected at: 2026-03-29 10:09:29
#>
#> --- HARD VIOLATIONS (must fix) ---
#>
#> [1] target_leakage_direct
#> Feature 'leaked' has correlation 1.000 with target 'outcome'. Likely derived from outcome.
#> Source: data.frame$leaked
#> Fix: Remove features derived from the target variableFix: Remove or investigate the leaky feature. If it’s a legitimate predictor, document why correlation > 0.99 is expected.
What: Same group (patient, site, species) appears in both train and test.
Why it matters: Observations within a group tend to be similar. If the same patient appears in train and test, the model can exploit patient-specific patterns that won’t exist for new patients.
Detection: Set intersection of group membership values.
# Clinical data with patient IDs
clinical <- data.frame(
patient_id = rep(1:10, each = 10),
measurement = rnorm(100)
)
# Random split ignoring patients
set.seed(123)
all_idx <- sample(100)
train_idx <- all_idx[1:70]
test_idx <- all_idx[71:100]
result <- borg_inspect(clinical, train_idx = train_idx, test_idx = test_idx,
groups = "patient_id")
result
#> BorgRisk Assessment
#> ===================
#>
#> Status: VALID (no hard violations)
#> Hard violations: 0
#> Soft inflations: 0
#> Train indices: 70 rows
#> Test indices: 30 rows
#> Inspected at: 2026-03-29 10:09:29
#>
#> No risks detected.Fix: Use group-aware splitting:
What: Test observations predate training observations.
Why it matters: Model uses future information to predict the past. In deployment, future data won’t be available.
Detection: Compare max training timestamp to min test timestamp.
# Time series data
ts_data <- data.frame(
date = seq(as.Date("2020-01-01"), by = "day", length.out = 100),
value = cumsum(rnorm(100))
)
# Wrong: random split ignores time
set.seed(42)
random_idx <- sample(100)
train_idx <- random_idx[1:70]
test_idx <- random_idx[71:100]
result <- borg_inspect(ts_data, train_idx = train_idx, test_idx = test_idx,
time = "date")
result
#> BorgRisk Assessment
#> ===================
#>
#> Status: VALID (no hard violations)
#> Hard violations: 0
#> Soft inflations: 0
#> Train indices: 70 rows
#> Test indices: 30 rows
#> Inspected at: 2026-03-29 10:09:29
#>
#> No risks detected.Fix: Use chronological splits where all test data comes after training:
What: Cross-validation folds contain test indices, or folds overlap incorrectly.
Why it matters: Nested CV requires the outer test set to be completely held out from all inner training.
Detection: Check if any fold’s training indices intersect with held-out test set.
Supported objects:
caret::trainControl - checks $index and
$indexOut
rsample::vfold_cv and other rset
objects
rsample::rsplit objects
What: Model was trained on more rows than claimed training set.
Why it matters: Model saw test data during training, even if indirectly (e.g., through hyperparameter tuning on full data).
Detection: Compare nrow(trainingData)
or length(fitted.values) to
length(train_idx).
Supported objects: lm,
glm, ranger, caret::train,
parsnip models, workflows.
These bias results but may not completely invalidate them. Model ranking might be preserved even if absolute metrics are optimistic.
What: Feature has correlation 0.95-0.99 with target.
Why warning not error: May be a legitimate strong predictor. Requires domain knowledge to judge.
Detection: Same as direct leakage, different threshold.
# Strong but not extreme correlation
proxy <- data.frame(
x = rnorm(100),
outcome = rnorm(100)
)
proxy$strong_predictor <- proxy$outcome + rnorm(100, sd = 0.3) # r ~ 0.96
result <- borg_inspect(proxy, train_idx = 1:70, test_idx = 71:100, target = "outcome")
result
#> BorgRisk Assessment
#> ===================
#>
#> Status: VALID (no hard violations)
#> Hard violations: 0
#> Soft inflations: 1
#> Train indices: 70 rows
#> Test indices: 30 rows
#> Inspected at: 2026-03-29 10:09:29
#>
#> --- SOFT INFLATIONS (warnings) ---
#>
#> [1] target_leakage_proxy
#> Feature 'strong_predictor' has correlation 0.959 with target 'outcome'. May be a proxy for outcome.
#> Source: data.frame$strong_predictor
#> Fix: Review evaluation workflow for potential information reuseAction: Review whether the feature should be available at prediction time in production.
What: Test points are very close to training points in geographic space.
Why it matters: Spatial autocorrelation means nearby points share variance. Model learns local patterns that don’t generalize to distant locations.
Detection: Compute minimum distance from each test point to nearest training point. Flag if < 1% of spatial spread.
set.seed(42)
spatial <- data.frame(
lon = runif(100, 0, 100),
lat = runif(100, 0, 100),
value = rnorm(100)
)
# Random split intermixes nearby points
train_idx <- sample(100, 70)
test_idx <- setdiff(1:100, train_idx)
result <- borg_inspect(spatial, train_idx = train_idx, test_idx = test_idx,
coords = c("lon", "lat"))
result
#> BorgRisk Assessment
#> ===================
#>
#> Status: VALID (no hard violations)
#> Hard violations: 0
#> Soft inflations: 1
#> Train indices: 70 rows
#> Test indices: 30 rows
#> Inspected at: 2026-03-29 10:09:30
#>
#> --- SOFT INFLATIONS (warnings) ---
#>
#> [1] spatial_overlap
#> 93% of test points fall within the training region convex hull. Consider spatial blocking.
#> Source: data.frame
#> Fix: Review evaluation workflow for potential information reuseFix: Use spatial blocking:
What: Test region falls inside training region’s convex hull.
Why it matters: Interpolation is easier than extrapolation. Model performance on “surrounded” test points overestimates performance on truly new regions.
Detection: Compute convex hull of training points, count test points inside.
Threshold: Warning if > 50% of test points fall inside training hull.
What: Using random k-fold CV when data has spatial, temporal, or group structure.
Why it matters: Random folds break dependencies artificially, leading to optimistic error estimates.
# Diagnose data dependencies
spatial <- data.frame(
lon = runif(200, 0, 100),
lat = runif(200, 0, 100),
response = rnorm(200)
)
diagnosis <- borg_diagnose(spatial, coords = c("lon", "lat"), target = "response",
verbose = FALSE)
diagnosis@recommended_cv
#> [1] "random"Fix: Use borg() to generate appropriate
blocked CV folds.
| Risk Type | Severity | Detection Method | Fix |
|---|---|---|---|
index_overlap |
Hard | Index intersection | Use setdiff() |
duplicate_rows |
Hard | Row hashing | Deduplicate or group |
preprocessing_leak |
Hard | Parameter comparison | Fit on train only |
target_leakage |
Hard | Correlation > 0.99 | Remove feature |
group_leakage |
Hard | Group intersection | Group-aware split |
temporal_leak |
Hard | Timestamp comparison | Chronological split |
cv_contamination |
Hard | Fold index check | Rebuild folds |
model_scope |
Hard | Row count | Refit on train only |
proxy_leakage |
Soft | Correlation 0.95-0.99 | Domain review |
spatial_proximity |
Soft | Distance check | Spatial blocking |
spatial_overlap |
Soft | Convex hull | Geographic split |
# Create result with violations
result <- borg_inspect(
data.frame(x = 1:100, y = rnorm(100)),
train_idx = 1:60,
test_idx = 51:100
)
# Summary
cat("Valid:", result@is_valid, "\n")
#> Valid: FALSE
cat("Hard violations:", result@n_hard, "\n")
#> Hard violations: 1
cat("Soft warnings:", result@n_soft, "\n")
#> Soft warnings: 0
# Individual risks
for (risk in result@risks) {
cat("\n", risk$type, "(", risk$severity, "):\n", sep = "")
cat(" ", risk$description, "\n")
if (!is.null(risk$affected)) {
cat(" Affected:", head(risk$affected, 5), "...\n")
}
}
#>
#> index_overlap(hard_violation):
#> Train and test indices overlap (10 shared indices). This invalidates evaluation.
#> Affected: 51 52 53 54 55 ...
# Tabular format
as.data.frame(result)
#> type severity
#> 1 index_overlap hard_violation
#> description
#> 1 Train and test indices overlap (10 shared indices). This invalidates evaluation.
#> source_object n_affected
#> 1 train_idx/test_idx 10
#> suggested_fix
#> 1 Recreate train/test split with non-overlapping indicesvignette("quickstart") - Basic usage
vignette("frameworks") - Framework
integration
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.