The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
autoFlagR is an R package for automated data quality
auditing using unsupervised machine learning. It provides AI-driven
anomaly detection for data quality assessment, primarily designed for
Electronic Health Records (EHR) data, with benchmarking capabilities for
validation and publication.
The typical workflow consists of three main steps:
The prep_for_anomaly() function automatically handles: -
Identifier columns (patient_id, encounter_id, etc.) - Missing value
imputation - Numerical feature scaling (MAD or min-max) - Categorical
variable encoding (one-hot)
# Example healthcare data
data <- data.frame(
patient_id = 1:200,
age = rnorm(200, 50, 15),
cost = rnorm(200, 10000, 5000),
length_of_stay = rpois(200, 5),
gender = sample(c("M", "F"), 200, replace = TRUE),
diagnosis = sample(c("A", "B", "C"), 200, replace = TRUE)
)
# Introduce some anomalies
data$cost[1:5] <- data$cost[1:5] * 20 # Unusually high costs
data$age[6:8] <- c(200, 180, 190) # Impossible ages
# Prepare data for anomaly detection
prepared <- prep_for_anomaly(data, id_cols = "patient_id")Use either Isolation Forest (default) or Local Outlier Factor (LOF):
# Score anomalies using Isolation Forest
scored_data <- score_anomaly(
data,
method = "iforest",
contamination = 0.05
)
#> Warning in (function (data, sample_size = min(nrow(data), 10000L), ntrees =
#> 500, : Attempting to use more than 1 thread, but package was compiled without
#> OpenMP support. See
#> https://github.com/david-cortes/installing-optimized-libraries#4-macos-install-and-enable-openmp
# View anomaly scores
head(scored_data[, c("patient_id", "anomaly_score")], 10)
#> patient_id anomaly_score
#> 1 1 0.15034167
#> 2 2 0.21395292
#> 3 3 0.00000000
#> 4 4 0.02693202
#> 5 5 0.23670251
#> 6 6 0.04638215
#> 7 7 0.11533699
#> 8 8 0.15881136
#> 9 9 0.92531753
#> 10 10 0.71809012Flag records as anomalous based on threshold or contamination rate:
# Flag top anomalies
flagged_data <- flag_top_anomalies(
scored_data,
contamination = 0.05
)
# View flagged anomalies
anomalies <- flagged_data[flagged_data$is_anomaly, ]
head(anomalies[, c("patient_id", "anomaly_score", "is_anomaly")], 10)
#> patient_id anomaly_score is_anomaly
#> 39 39 0.9697503 TRUE
#> 56 56 0.9862881 TRUE
#> 63 63 0.9727825 TRUE
#> 73 73 0.9998179 TRUE
#> 135 135 0.9830231 TRUE
#> 157 157 1.0000000 TRUE
#> 175 175 0.9912094 TRUE
#> 184 184 0.9810962 TRUE
#> 191 191 0.9733082 TRUE
#> 192 192 0.9776592 TRUEThese binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.