The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Data frames collected from surveys, administrative records, or manual
entry almost always contain semantic inconsistencies that rule-based
tools cannot catch. The llmclean package addresses this by
sending a compact representation of your data to a large language model
(LLM) and requesting structured, human-readable suggestions for each
issue found.
The key distinction from existing tools:
| Tool | What it catches |
|---|---|
janitor |
Column name formatting |
validate / pointblank |
Rule-based type/range checks |
llmclean |
Semantic meaning: typos, abbreviations, case, malformed formats, cross-field contradictions |
No existing CRAN package provides the detect → explain → suggest → apply workflow specialised for semantic data frame inconsistencies.
openai → GPT-4o, GPT-4o-mini (API key required)
anthropic → Claude Haiku / Sonnet (API key required)
google → Gemini 2.0 Flash (free tier available)
groq → LLaMA 3.1, Mixtral (free tier available)
ollama → Any local model (no key, fully offline)
offline → Statistical fallback (no API, no internet)
library(llmclean)
library(dplyr)For this vignette we use offline mode, which requires no API key:
set_llm_provider("offline")With a real LLM provider (not run here):
# Free Groq tier — fastest inference
set_llm_provider("groq",
api_key = Sys.getenv("GROQ_API_KEY"),
model = "llama-3.1-8b-instant")
# OpenAI
set_llm_provider("openai",
api_key = Sys.getenv("OPENAI_API_KEY"),
model = "gpt-4o-mini")
# Anthropic Claude
set_llm_provider("anthropic",
api_key = Sys.getenv("ANTHROPIC_API_KEY"),
model = "claude-haiku-4-5-20251001")
# Local Ollama (no key needed, model must be installed)
set_llm_provider("ollama", model = "llama3")data(messy_employees)
data(messy_survey)
cat("messy_employees:", nrow(messy_employees), "rows x",
ncol(messy_employees), "cols\n\n")
#> messy_employees: 20 rows x 8 cols
# Peek at known issues
cat("Status variants:\n"); print(table(messy_employees$status))
#> Status variants:
#>
#> ACTIVE Active INACTIVE Inactive active actve inactive
#> 2 5 1 1 5 2 4
cat("\nDepartment variants:\n"); print(table(messy_employees$department))
#>
#> Department variants:
#>
#> Finanace Finance HR Human Resources I.T.
#> 1 4 3 1 1
#> IT MARKETING Marketing finance hr
#> 5 1 1 1 1
#> marketing
#> 1
cat("\nAge outliers:", messy_employees$age[messy_employees$age < 0 |
messy_employees$age > 100], "\n")
#>
#> Age outliers: -5 150The data contains at least six types of inconsistency across eight columns — typical of manually entered HR records.
detect_issues() calls the LLM (or offline fallback) and
returns a tidy tibble with one row per detected problem.
issues <- detect_issues(messy_employees,
context = "HR employee records. Status values
should be 'active' or 'inactive'.")
cat("Issues found:", nrow(issues), "\n\n")
#> Issues found: 103
print(issues[, c("column","row_index","value","issue_type",
"suggestion","confidence")])
#> # A tibble: 103 × 6
#> column row_index value issue_type suggestion confidence
#> <chr> <int> <chr> <chr> <chr> <dbl>
#> 1 age 10 150 outlier REVIEW 0.75
#> 2 department 1 Finance format REVIEW 0.8
#> 3 department 2 finance case Finance 0.9
#> 4 department 2 finance format REVIEW 0.8
#> 5 department 3 Finance format REVIEW 0.8
#> 6 department 4 HR typo it 0.76
#> 7 department 4 HR format REVIEW 0.8
#> 8 department 5 hr case HR 0.9
#> 9 department 5 hr typo it 0.76
#> 10 department 5 hr format REVIEW 0.8
#> # ℹ 93 more rows# Summary by type
as.data.frame(table(Type = issues$issue_type)) |>
dplyr::arrange(dplyr::desc(Freq))
#> Type Freq
#> 1 format 61
#> 2 case 20
#> 3 typo 13
#> 4 abbreviation 7
#> 5 outlier 2# Show all case inconsistencies found
issues[issues$issue_type == "case",
c("column","row_index","value","suggestion","confidence")]
#> # A tibble: 20 × 5
#> column row_index value suggestion confidence
#> <chr> <int> <chr> <chr> <dbl>
#> 1 department 2 finance Finance 0.9
#> 2 department 5 hr HR 0.9
#> 3 department 10 Marketing MARKETING 0.9
#> 4 department 11 marketing MARKETING 0.9
#> 5 name 3 alice johnson Alice Johnson 0.9
#> 6 status 1 Active ACTIVE 0.9
#> 7 status 2 active ACTIVE 0.9
#> 8 status 4 active ACTIVE 0.9
#> 9 status 5 Inactive INACTIVE 0.9
#> 10 status 6 inactive INACTIVE 0.9
#> 11 status 8 Active ACTIVE 0.9
#> 12 status 10 active ACTIVE 0.9
#> 13 status 11 Active ACTIVE 0.9
#> 14 status 12 inactive INACTIVE 0.9
#> 15 status 14 active ACTIVE 0.9
#> 16 status 15 Active ACTIVE 0.9
#> 17 status 16 inactive INACTIVE 0.9
#> 18 status 17 inactive INACTIVE 0.9
#> 19 status 18 Active ACTIVE 0.9
#> 20 status 20 active ACTIVE 0.9The offline detector uses utils::adist() to compute
Levenshtein edit distances between all pairs of unique values in each
column. Values within 2 edits of a more frequent value are flagged as
typos.
issues[issues$issue_type == "typo",
c("column","row_index","value","suggestion","explanation")]
#> # A tibble: 13 × 5
#> column row_index value suggestion explanation
#> <chr> <int> <chr> <chr> <chr>
#> 1 department 4 HR it 'HR' is 2 edit(s) f…
#> 2 department 5 hr it 'hr' is 2 edit(s) f…
#> 3 department 15 Finanace finance 'Finanace' is 1 edi…
#> 4 department 19 HR it 'HR' is 2 edit(s) f…
#> 5 department 20 HR it 'HR' is 2 edit(s) f…
#> 6 email 7 fiona@company.com nina@company.com 'fiona@company.com'…
#> 7 email 10 ian@company.com sam@company.com 'ian@company.com' i…
#> 8 email 13 laura@company.com paula@company.com 'laura@company.com'…
#> 9 hire_date 8 2021-01-18 2021-06-14 '2021-01-18' is 2 e…
#> 10 hire_date 13 2011-07-15 2018-03-15 '2011-07-15' is 2 e…
#> 11 hire_date 20 2018-09-25 2018-03-15 '2018-09-25' is 2 e…
#> 12 status 9 actve active 'actve' is 1 edit(s…
#> 13 status 19 actve active 'actve' is 1 edit(s…issues[issues$issue_type == "format",
c("column","row_index","value","suggestion")]
#> # A tibble: 61 × 4
#> column row_index value suggestion
#> <chr> <int> <chr> <chr>
#> 1 department 1 Finance REVIEW
#> 2 department 2 finance REVIEW
#> 3 department 3 Finance REVIEW
#> 4 department 4 HR REVIEW
#> 5 department 5 hr REVIEW
#> 6 department 7 IT REVIEW
#> 7 department 8 I.T. REVIEW
#> 8 department 9 IT REVIEW
#> 9 department 10 Marketing REVIEW
#> 10 department 11 marketing REVIEW
#> # ℹ 51 more rowsissues[issues$issue_type == "outlier",
c("column","row_index","value","explanation")]
#> # A tibble: 2 × 4
#> column row_index value explanation
#> <chr> <int> <chr> <chr>
#> 1 age 10 150 Value 150 is outside the Tukey outer fence [-7, 77] (…
#> 2 salary 10 999999 Value 999999 is outside the Tukey outer fence [2500, …suggest_fixes() can enrich low-confidence suggestions by
re-querying the LLM with surrounding row context. In offline mode it
returns the issues unchanged.
enriched <- suggest_fixes(messy_employees, issues)
cat("Enriched columns:", paste(names(enriched), collapse = ", "), "\n")
#> Enriched columns: column, row_index, value, issue_type, explanation, suggestion, confidence, provider, model, alternatives, confidence_revised
# Show suggestions for status column
enriched[enriched$column == "status",
c("row_index","value","suggestion","alternatives","confidence_revised")]
#> # A tibble: 43 × 5
#> row_index value suggestion alternatives confidence_revised
#> <int> <chr> <chr> <chr> <dbl>
#> 1 1 Active ACTIVE ACTIVE 0.9
#> 2 1 Active REVIEW REVIEW 0.8
#> 3 2 active ACTIVE ACTIVE 0.9
#> 4 2 active REVIEW REVIEW 0.8
#> 5 3 ACTIVE REVIEW REVIEW 0.8
#> 6 4 active ACTIVE ACTIVE 0.9
#> 7 4 active REVIEW REVIEW 0.8
#> 8 5 Inactive INACTIVE INACTIVE 0.9
#> 9 5 Inactive active active 0.76
#> 10 5 Inactive REVIEW REVIEW 0.8
#> # ℹ 33 more rowsapply_fixes() has two modes:
confirm = FALSE: apply all fixes above
min_confidence automatically (batch mode)confirm = TRUE: interactive review,
one fix at a time# Non-interactive: apply fixes with confidence >= 0.88
df_clean <- apply_fixes(
messy_employees,
enriched,
confirm = FALSE,
min_confidence = 0.88
)
cat("Status before:", paste(sort(unique(messy_employees$status)), collapse=", "), "\n")
#> Status before: ACTIVE, Active, INACTIVE, Inactive, active, actve, inactive
cat("Status after: ", paste(sort(unique(df_clean$status)), collapse=", "), "\n\n")
#> Status after: ACTIVE, INACTIVE, active
cat("Department before:",
paste(sort(unique(messy_employees$department)), collapse=", "), "\n")
#> Department before: Finanace, Finance, HR, Human Resources, I.T., IT, MARKETING, Marketing, finance, hr, marketing
cat("Department after: ",
paste(sort(unique(df_clean$department)), collapse=", "), "\n")
#> Department after: Finance, HR, Human Resources, I.T., IT, MARKETING, financeBefore applying anything, preview what would change:
plan <- apply_fixes(messy_employees, enriched, dry_run = TRUE)
cat("Planned changes:\n")
#> Planned changes:
print(plan[, c("column","row_index","current_value","suggestion","issue_type")])
#> # A tibble: 103 × 5
#> column row_index current_value suggestion issue_type
#> <chr> <int> <chr> <chr> <chr>
#> 1 age 10 150 REVIEW outlier
#> 2 department 1 Finance REVIEW format
#> 3 department 2 finance Finance case
#> 4 department 2 finance REVIEW format
#> 5 department 3 Finance REVIEW format
#> 6 department 4 HR it typo
#> 7 department 4 HR REVIEW format
#> 8 department 5 hr HR case
#> 9 department 5 hr it typo
#> 10 department 5 hr REVIEW format
#> # ℹ 93 more rowsoffline_detect() runs without any LLM using three
statistical methods:
# Works completely offline
offline_issues <- offline_detect(
messy_survey,
issue_types = c("case","typo","format","outlier"),
max_edit_distance = 2L
)
cat("Survey issues found:", nrow(offline_issues), "\n\n")
#> Survey issues found: 57
offline_issues[, c("column","value","issue_type","suggestion","confidence")]
#> # A tibble: 57 × 5
#> column value issue_type suggestion confidence
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 age_group 18-25 format REVIEW 0.8
#> 2 age_group 18-25 format REVIEW 0.8
#> 3 age_group 18-25 format REVIEW 0.8
#> 4 age_group 26-35 typo 36-45 0.76
#> 5 age_group 26-35 format REVIEW 0.8
#> 6 age_group 26-35 typo 36-45 0.76
#> 7 age_group 26-35 format REVIEW 0.8
#> 8 age_group 26-35 typo 36-45 0.76
#> 9 age_group 26-35 format REVIEW 0.8
#> 10 age_group 36-45 typo 46-55 0.76
#> # ℹ 47 more rowsllmclean_report() produces a structured audit log
suitable for reproducible data quality documentation.
rpt <- llmclean_report(messy_employees, df_clean, issues)
#>
#> === llmclean Data Quality Report ===
#> Provider : offline (statistical)
#> Data : 20 rows x 8 columns
#> Issues found: 103
#> Fixes applied: 23 / skipped: 80
#>
#> Issues by column and type:
#> column issue_type n_detected n_applied
#> age outlier 1 0
#> department abbreviation 1 0
#> department case 4 4
#> department format 19 0
#> department typo 5 1
#> email format 2 0
#> email typo 3 0
#> hire_date format 20 0
#> hire_date typo 3 0
#> name case 1 1
#> salary outlier 1 0
#> status abbreviation 6 0
#> status case 15 15
#> status format 20 0
#> status typo 2 2
#>
#> Fixes applied:
#> # A tibble: 23 × 4
#> column row_index original corrected
#> <chr> <int> <chr> <chr>
#> 1 department 2 finance Finance
#> 2 department 5 hr HR
#> 3 department 10 Marketing MARKETING
#> 4 department 11 marketing MARKETING
#> 5 department 15 Finanace finance
#> 6 name 3 alice johnson Alice Johnson
#> 7 status 1 Active ACTIVE
#> 8 status 2 active ACTIVE
#> 9 status 4 active ACTIVE
#> 10 status 5 Inactive INACTIVE
#> # ℹ 13 more rows
#> =====================================cat("Summary by column and type:\n")
#> Summary by column and type:
print(rpt$summary)
#> # A tibble: 15 × 4
#> column issue_type n_detected n_applied
#> <fct> <fct> <int> <int>
#> 1 age outlier 1 0
#> 2 department abbreviation 1 0
#> 3 department case 4 4
#> 4 department format 19 0
#> 5 department typo 5 1
#> 6 email format 2 0
#> 7 email typo 3 0
#> 8 hire_date format 20 0
#> 9 hire_date typo 3 0
#> 10 name case 1 1
#> 11 salary outlier 1 0
#> 12 status abbreviation 6 0
#> 13 status case 15 15
#> 14 status format 20 0
#> 15 status typo 2 2
cat("\nCell-level changes (first 8):\n")
#>
#> Cell-level changes (first 8):
print(head(rpt$changes, 8))
#> # A tibble: 8 × 4
#> column row_index original corrected
#> <chr> <int> <chr> <chr>
#> 1 department 2 finance Finance
#> 2 department 5 hr HR
#> 3 department 10 Marketing MARKETING
#> 4 department 11 marketing MARKETING
#> 5 department 15 Finanace finance
#> 6 name 3 alice johnson Alice Johnson
#> 7 status 1 Active ACTIVE
#> 8 status 2 active ACTIVE
cat("\nMetadata:\n")
#>
#> Metadata:
cat(" Provider :", rpt$metadata$provider, "\n")
#> Provider : offline
cat(" Model :", rpt$metadata$model, "\n")
#> Model : statistical
cat(" Detected :", rpt$metadata$n_total, "\n")
#> Detected : 103
cat(" Applied :", rpt$metadata$n_applied, "\n")
#> Applied : 23library(llmclean)
# 1. Configure provider (use Groq free tier)
set_llm_provider("groq",
api_key = Sys.getenv("GROQ_API_KEY"),
model = "llama-3.1-8b-instant")
# 2. Load data
data(messy_employees)
# 3. Detect semantic issues
issues <- detect_issues(
messy_employees,
context = "Employee records. Status: active/inactive. Age: 18-70."
)
# 4. Enrich low-confidence suggestions
enriched <- suggest_fixes(messy_employees, issues, n_alternatives = 2L)
# 5. Apply fixes non-interactively
df_clean <- apply_fixes(messy_employees, enriched,
confirm = FALSE, min_confidence = 0.80)
# 6. Generate audit report
llmclean_report(messy_employees, df_clean, issues)Never hardcode API keys in scripts. Store them in
.Renviron:
# ~/.Renviron
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GROQ_API_KEY=gsk_...
GOOGLE_API_KEY=AIza...Then retrieve with Sys.getenv("OPENAI_API_KEY"). The
usethis package provides
usethis::edit_r_environ() to open this file.
sessionInfo()
#> R version 4.2.1 (2022-06-23 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 26200)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=C
#> [2] LC_CTYPE=English_United States.utf8
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.utf8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] dplyr_1.2.0 llmclean_0.1.0
#>
#> loaded via a namespace (and not attached):
#> [1] rstudioapi_0.18.0 knitr_1.51 magrittr_2.0.3 tidyselect_1.2.1
#> [5] R6_2.6.1 rlang_1.1.7 fastmap_1.2.0 tools_4.2.1
#> [9] xfun_0.57 utf8_1.2.6 cli_3.6.5 withr_3.0.2
#> [13] jquerylib_0.1.4 htmltools_0.5.9 yaml_2.3.12 digest_0.6.39
#> [17] tibble_3.3.1 lifecycle_1.0.5 sass_0.4.10 vctrs_0.7.2
#> [21] glue_1.7.0 cachem_1.1.0 evaluate_1.0.5 rmarkdown_2.31
#> [25] compiler_4.2.1 bslib_0.10.0 pillar_1.11.1 generics_0.1.4
#> [29] otel_0.2.0 jsonlite_2.0.0 pkgconfig_2.0.3Chaudhuri, S., Ganjam, K., Ganti, V. and Motwani, R. (2003). Robust and efficient fuzzy match for online data cleaning. Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, 313–324. https://doi.org/10.1145/872757.872796
de Jonge, E. and van der Loo, M. (2013). An introduction to data cleaning with R. Statistics Netherlands Discussion Paper. https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf
Levenshtein, V.I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8), 707–710.
Müller, H. and Freytag, J.C. (2003). Problems, methods, and challenges in comprehensive data cleansing. Technical Report HUB-IB-164, Humboldt University Berlin.
Tukey, J.W. (1977). Exploratory Data Analysis. Addison-Wesley. ISBN: 978-0-201-07616-5.
van der Loo, M.P.J. and de Jonge, E. (2018). Statistical Data Cleaning with Applications in R. John Wiley & Sons. https://doi.org/10.1002/9781118897126
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.