The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

LLM-Assisted Data Cleaning with llmclean

Sadikul Islam

2026-04-22

1 Introduction

Data frames collected from surveys, administrative records, or manual entry almost always contain semantic inconsistencies that rule-based tools cannot catch. The llmclean package addresses this by sending a compact representation of your data to a large language model (LLM) and requesting structured, human-readable suggestions for each issue found.

The key distinction from existing tools:

Tool What it catches
janitor Column name formatting
validate / pointblank Rule-based type/range checks
llmclean Semantic meaning: typos, abbreviations, case, malformed formats, cross-field contradictions

No existing CRAN package provides the detect → explain → suggest → apply workflow specialised for semantic data frame inconsistencies.

1.1 Supported Providers

openai     →  GPT-4o, GPT-4o-mini     (API key required)
anthropic  →  Claude Haiku / Sonnet   (API key required)
google     →  Gemini 2.0 Flash        (free tier available)
groq       →  LLaMA 3.1, Mixtral      (free tier available)
ollama     →  Any local model         (no key, fully offline)
offline    →  Statistical fallback    (no API, no internet)

2 Quick Start

library(llmclean)
library(dplyr)

2.1 Step 1 — Configure the provider

For this vignette we use offline mode, which requires no API key:

set_llm_provider("offline")

With a real LLM provider (not run here):

# Free Groq tier — fastest inference
set_llm_provider("groq",
                 api_key = Sys.getenv("GROQ_API_KEY"),
                 model   = "llama-3.1-8b-instant")

# OpenAI
set_llm_provider("openai",
                 api_key = Sys.getenv("OPENAI_API_KEY"),
                 model   = "gpt-4o-mini")

# Anthropic Claude
set_llm_provider("anthropic",
                 api_key = Sys.getenv("ANTHROPIC_API_KEY"),
                 model   = "claude-haiku-4-5-20251001")

# Local Ollama (no key needed, model must be installed)
set_llm_provider("ollama", model = "llama3")

2.2 Step 2 — Inspect the built-in messy datasets

data(messy_employees)
data(messy_survey)

cat("messy_employees:", nrow(messy_employees), "rows x",
    ncol(messy_employees), "cols\n\n")
#> messy_employees: 20 rows x 8 cols

# Peek at known issues
cat("Status variants:\n"); print(table(messy_employees$status))
#> Status variants:
#> 
#>   ACTIVE   Active INACTIVE Inactive   active    actve inactive 
#>        2        5        1        1        5        2        4
cat("\nDepartment variants:\n"); print(table(messy_employees$department))
#> 
#> Department variants:
#> 
#>        Finanace         Finance              HR Human Resources            I.T. 
#>               1               4               3               1               1 
#>              IT       MARKETING       Marketing         finance              hr 
#>               5               1               1               1               1 
#>       marketing 
#>               1
cat("\nAge outliers:", messy_employees$age[messy_employees$age < 0 |
                                             messy_employees$age > 100], "\n")
#> 
#> Age outliers: -5 150

The data contains at least six types of inconsistency across eight columns — typical of manually entered HR records.


3 Stage 1 — Detect Issues

detect_issues() calls the LLM (or offline fallback) and returns a tidy tibble with one row per detected problem.

issues <- detect_issues(messy_employees,
                         context = "HR employee records. Status values
                                    should be 'active' or 'inactive'.")
cat("Issues found:", nrow(issues), "\n\n")
#> Issues found: 103
print(issues[, c("column","row_index","value","issue_type",
                  "suggestion","confidence")])
#> # A tibble: 103 × 6
#>    column     row_index value   issue_type suggestion confidence
#>    <chr>          <int> <chr>   <chr>      <chr>           <dbl>
#>  1 age               10 150     outlier    REVIEW           0.75
#>  2 department         1 Finance format     REVIEW           0.8 
#>  3 department         2 finance case       Finance          0.9 
#>  4 department         2 finance format     REVIEW           0.8 
#>  5 department         3 Finance format     REVIEW           0.8 
#>  6 department         4 HR      typo       it               0.76
#>  7 department         4 HR      format     REVIEW           0.8 
#>  8 department         5 hr      case       HR               0.9 
#>  9 department         5 hr      typo       it               0.76
#> 10 department         5 hr      format     REVIEW           0.8 
#> # ℹ 93 more rows

3.1 Issue type breakdown

# Summary by type
as.data.frame(table(Type = issues$issue_type)) |>
  dplyr::arrange(dplyr::desc(Freq))
#>           Type Freq
#> 1       format   61
#> 2         case   20
#> 3         typo   13
#> 4 abbreviation    7
#> 5      outlier    2

3.2 Case inconsistencies

# Show all case inconsistencies found
issues[issues$issue_type == "case",
       c("column","row_index","value","suggestion","confidence")]
#> # A tibble: 20 × 5
#>    column     row_index value         suggestion    confidence
#>    <chr>          <int> <chr>         <chr>              <dbl>
#>  1 department         2 finance       Finance              0.9
#>  2 department         5 hr            HR                   0.9
#>  3 department        10 Marketing     MARKETING            0.9
#>  4 department        11 marketing     MARKETING            0.9
#>  5 name               3 alice johnson Alice Johnson        0.9
#>  6 status             1 Active        ACTIVE               0.9
#>  7 status             2 active        ACTIVE               0.9
#>  8 status             4 active        ACTIVE               0.9
#>  9 status             5 Inactive      INACTIVE             0.9
#> 10 status             6 inactive      INACTIVE             0.9
#> 11 status             8 Active        ACTIVE               0.9
#> 12 status            10 active        ACTIVE               0.9
#> 13 status            11 Active        ACTIVE               0.9
#> 14 status            12 inactive      INACTIVE             0.9
#> 15 status            14 active        ACTIVE               0.9
#> 16 status            15 Active        ACTIVE               0.9
#> 17 status            16 inactive      INACTIVE             0.9
#> 18 status            17 inactive      INACTIVE             0.9
#> 19 status            18 Active        ACTIVE               0.9
#> 20 status            20 active        ACTIVE               0.9

3.3 Typos and near-duplicates (Levenshtein)

The offline detector uses utils::adist() to compute Levenshtein edit distances between all pairs of unique values in each column. Values within 2 edits of a more frequent value are flagged as typos.

issues[issues$issue_type == "typo",
       c("column","row_index","value","suggestion","explanation")]
#> # A tibble: 13 × 5
#>    column     row_index value             suggestion        explanation         
#>    <chr>          <int> <chr>             <chr>             <chr>               
#>  1 department         4 HR                it                'HR' is 2 edit(s) f…
#>  2 department         5 hr                it                'hr' is 2 edit(s) f…
#>  3 department        15 Finanace          finance           'Finanace' is 1 edi…
#>  4 department        19 HR                it                'HR' is 2 edit(s) f…
#>  5 department        20 HR                it                'HR' is 2 edit(s) f…
#>  6 email              7 fiona@company.com nina@company.com  'fiona@company.com'…
#>  7 email             10 ian@company.com   sam@company.com   'ian@company.com' i…
#>  8 email             13 laura@company.com paula@company.com 'laura@company.com'…
#>  9 hire_date          8 2021-01-18        2021-06-14        '2021-01-18' is 2 e…
#> 10 hire_date         13 2011-07-15        2018-03-15        '2011-07-15' is 2 e…
#> 11 hire_date         20 2018-09-25        2018-03-15        '2018-09-25' is 2 e…
#> 12 status             9 actve             active            'actve' is 1 edit(s…
#> 13 status            19 actve             active            'actve' is 1 edit(s…

3.4 Malformed email addresses

issues[issues$issue_type == "format",
       c("column","row_index","value","suggestion")]
#> # A tibble: 61 × 4
#>    column     row_index value     suggestion
#>    <chr>          <int> <chr>     <chr>     
#>  1 department         1 Finance   REVIEW    
#>  2 department         2 finance   REVIEW    
#>  3 department         3 Finance   REVIEW    
#>  4 department         4 HR        REVIEW    
#>  5 department         5 hr        REVIEW    
#>  6 department         7 IT        REVIEW    
#>  7 department         8 I.T.      REVIEW    
#>  8 department         9 IT        REVIEW    
#>  9 department        10 Marketing REVIEW    
#> 10 department        11 marketing REVIEW    
#> # ℹ 51 more rows

3.5 Numeric outliers (Tukey outer fence)

issues[issues$issue_type == "outlier",
       c("column","row_index","value","explanation")]
#> # A tibble: 2 × 4
#>   column row_index value  explanation                                           
#>   <chr>      <int> <chr>  <chr>                                                 
#> 1 age           10 150    Value 150 is outside the Tukey outer fence [-7, 77] (…
#> 2 salary        10 999999 Value 999999 is outside the Tukey outer fence [2500, …

4 Stage 2 — Suggest Fixes

suggest_fixes() can enrich low-confidence suggestions by re-querying the LLM with surrounding row context. In offline mode it returns the issues unchanged.

enriched <- suggest_fixes(messy_employees, issues)
cat("Enriched columns:", paste(names(enriched), collapse = ", "), "\n")
#> Enriched columns: column, row_index, value, issue_type, explanation, suggestion, confidence, provider, model, alternatives, confidence_revised

# Show suggestions for status column
enriched[enriched$column == "status",
          c("row_index","value","suggestion","alternatives","confidence_revised")]
#> # A tibble: 43 × 5
#>    row_index value    suggestion alternatives confidence_revised
#>        <int> <chr>    <chr>      <chr>                     <dbl>
#>  1         1 Active   ACTIVE     ACTIVE                     0.9 
#>  2         1 Active   REVIEW     REVIEW                     0.8 
#>  3         2 active   ACTIVE     ACTIVE                     0.9 
#>  4         2 active   REVIEW     REVIEW                     0.8 
#>  5         3 ACTIVE   REVIEW     REVIEW                     0.8 
#>  6         4 active   ACTIVE     ACTIVE                     0.9 
#>  7         4 active   REVIEW     REVIEW                     0.8 
#>  8         5 Inactive INACTIVE   INACTIVE                   0.9 
#>  9         5 Inactive active     active                     0.76
#> 10         5 Inactive REVIEW     REVIEW                     0.8 
#> # ℹ 33 more rows

5 Stage 3 — Apply Fixes

apply_fixes() has two modes:

# Non-interactive: apply fixes with confidence >= 0.88
df_clean <- apply_fixes(
  messy_employees,
  enriched,
  confirm        = FALSE,
  min_confidence = 0.88
)

cat("Status before:", paste(sort(unique(messy_employees$status)), collapse=", "), "\n")
#> Status before: ACTIVE, Active, INACTIVE, Inactive, active, actve, inactive
cat("Status after: ", paste(sort(unique(df_clean$status)), collapse=", "), "\n\n")
#> Status after:  ACTIVE, INACTIVE, active

cat("Department before:",
    paste(sort(unique(messy_employees$department)), collapse=", "), "\n")
#> Department before: Finanace, Finance, HR, Human Resources, I.T., IT, MARKETING, Marketing, finance, hr, marketing
cat("Department after: ",
    paste(sort(unique(df_clean$department)), collapse=", "), "\n")
#> Department after:  Finance, HR, Human Resources, I.T., IT, MARKETING, finance

5.1 Dry run mode

Before applying anything, preview what would change:

plan <- apply_fixes(messy_employees, enriched, dry_run = TRUE)
cat("Planned changes:\n")
#> Planned changes:
print(plan[, c("column","row_index","current_value","suggestion","issue_type")])
#> # A tibble: 103 × 5
#>    column     row_index current_value suggestion issue_type
#>    <chr>          <int> <chr>         <chr>      <chr>     
#>  1 age               10 150           REVIEW     outlier   
#>  2 department         1 Finance       REVIEW     format    
#>  3 department         2 finance       Finance    case      
#>  4 department         2 finance       REVIEW     format    
#>  5 department         3 Finance       REVIEW     format    
#>  6 department         4 HR            it         typo      
#>  7 department         4 HR            REVIEW     format    
#>  8 department         5 hr            HR         case      
#>  9 department         5 hr            it         typo      
#> 10 department         5 hr            REVIEW     format    
#> # ℹ 93 more rows

6 Stage 4 — Offline Detection (No API Key)

offline_detect() runs without any LLM using three statistical methods:

  1. Levenshtein distance (Chaudhuri et al., 2003) for typo detection
  2. Regex patterns for email, date, and phone format validation
  3. Tukey outer fence (Tukey, 1977) for numeric outlier detection
# Works completely offline
offline_issues <- offline_detect(
  messy_survey,
  issue_types      = c("case","typo","format","outlier"),
  max_edit_distance = 2L
)

cat("Survey issues found:", nrow(offline_issues), "\n\n")
#> Survey issues found: 57
offline_issues[, c("column","value","issue_type","suggestion","confidence")]
#> # A tibble: 57 × 5
#>    column    value issue_type suggestion confidence
#>    <chr>     <chr> <chr>      <chr>           <dbl>
#>  1 age_group 18-25 format     REVIEW           0.8 
#>  2 age_group 18-25 format     REVIEW           0.8 
#>  3 age_group 18-25 format     REVIEW           0.8 
#>  4 age_group 26-35 typo       36-45            0.76
#>  5 age_group 26-35 format     REVIEW           0.8 
#>  6 age_group 26-35 typo       36-45            0.76
#>  7 age_group 26-35 format     REVIEW           0.8 
#>  8 age_group 26-35 typo       36-45            0.76
#>  9 age_group 26-35 format     REVIEW           0.8 
#> 10 age_group 36-45 typo       46-55            0.76
#> # ℹ 47 more rows

7 Stage 5 — Summary Report

llmclean_report() produces a structured audit log suitable for reproducible data quality documentation.

rpt <- llmclean_report(messy_employees, df_clean, issues)
#> 
#> === llmclean Data Quality Report ===
#> Provider    : offline (statistical)
#> Data        : 20 rows x 8 columns
#> Issues found: 103
#> Fixes applied: 23 / skipped: 80
#> 
#> Issues by column and type:
#>      column   issue_type n_detected n_applied
#>         age      outlier          1         0
#>  department abbreviation          1         0
#>  department         case          4         4
#>  department       format         19         0
#>  department         typo          5         1
#>       email       format          2         0
#>       email         typo          3         0
#>   hire_date       format         20         0
#>   hire_date         typo          3         0
#>        name         case          1         1
#>      salary      outlier          1         0
#>      status abbreviation          6         0
#>      status         case         15        15
#>      status       format         20         0
#>      status         typo          2         2
#> 
#> Fixes applied:
#> # A tibble: 23 × 4
#>    column     row_index original      corrected    
#>    <chr>          <int> <chr>         <chr>        
#>  1 department         2 finance       Finance      
#>  2 department         5 hr            HR           
#>  3 department        10 Marketing     MARKETING    
#>  4 department        11 marketing     MARKETING    
#>  5 department        15 Finanace      finance      
#>  6 name               3 alice johnson Alice Johnson
#>  7 status             1 Active        ACTIVE       
#>  8 status             2 active        ACTIVE       
#>  9 status             4 active        ACTIVE       
#> 10 status             5 Inactive      INACTIVE     
#> # ℹ 13 more rows
#> =====================================
cat("Summary by column and type:\n")
#> Summary by column and type:
print(rpt$summary)
#> # A tibble: 15 × 4
#>    column     issue_type   n_detected n_applied
#>    <fct>      <fct>             <int>     <int>
#>  1 age        outlier               1         0
#>  2 department abbreviation          1         0
#>  3 department case                  4         4
#>  4 department format               19         0
#>  5 department typo                  5         1
#>  6 email      format                2         0
#>  7 email      typo                  3         0
#>  8 hire_date  format               20         0
#>  9 hire_date  typo                  3         0
#> 10 name       case                  1         1
#> 11 salary     outlier               1         0
#> 12 status     abbreviation          6         0
#> 13 status     case                 15        15
#> 14 status     format               20         0
#> 15 status     typo                  2         2

cat("\nCell-level changes (first 8):\n")
#> 
#> Cell-level changes (first 8):
print(head(rpt$changes, 8))
#> # A tibble: 8 × 4
#>   column     row_index original      corrected    
#>   <chr>          <int> <chr>         <chr>        
#> 1 department         2 finance       Finance      
#> 2 department         5 hr            HR           
#> 3 department        10 Marketing     MARKETING    
#> 4 department        11 marketing     MARKETING    
#> 5 department        15 Finanace      finance      
#> 6 name               3 alice johnson Alice Johnson
#> 7 status             1 Active        ACTIVE       
#> 8 status             2 active        ACTIVE

cat("\nMetadata:\n")
#> 
#> Metadata:
cat("  Provider  :", rpt$metadata$provider, "\n")
#>   Provider  : offline
cat("  Model     :", rpt$metadata$model, "\n")
#>   Model     : statistical
cat("  Detected  :", rpt$metadata$n_total, "\n")
#>   Detected  : 103
cat("  Applied   :", rpt$metadata$n_applied, "\n")
#>   Applied   : 23

8 Complete Workflow (One Pipeline)

library(llmclean)

# 1. Configure provider (use Groq free tier)
set_llm_provider("groq",
                 api_key = Sys.getenv("GROQ_API_KEY"),
                 model   = "llama-3.1-8b-instant")

# 2. Load data
data(messy_employees)

# 3. Detect semantic issues
issues <- detect_issues(
  messy_employees,
  context = "Employee records. Status: active/inactive. Age: 18-70."
)

# 4. Enrich low-confidence suggestions
enriched <- suggest_fixes(messy_employees, issues, n_alternatives = 2L)

# 5. Apply fixes non-interactively
df_clean <- apply_fixes(messy_employees, enriched,
                         confirm = FALSE, min_confidence = 0.80)

# 6. Generate audit report
llmclean_report(messy_employees, df_clean, issues)

9 API Key Storage Best Practice

Never hardcode API keys in scripts. Store them in .Renviron:

# ~/.Renviron
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GROQ_API_KEY=gsk_...
GOOGLE_API_KEY=AIza...

Then retrieve with Sys.getenv("OPENAI_API_KEY"). The usethis package provides usethis::edit_r_environ() to open this file.


10 Session Information

sessionInfo()
#> R version 4.2.1 (2022-06-23 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 26200)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=C                          
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.2.0    llmclean_0.1.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] rstudioapi_0.18.0 knitr_1.51        magrittr_2.0.3    tidyselect_1.2.1 
#>  [5] R6_2.6.1          rlang_1.1.7       fastmap_1.2.0     tools_4.2.1      
#>  [9] xfun_0.57         utf8_1.2.6        cli_3.6.5         withr_3.0.2      
#> [13] jquerylib_0.1.4   htmltools_0.5.9   yaml_2.3.12       digest_0.6.39    
#> [17] tibble_3.3.1      lifecycle_1.0.5   sass_0.4.10       vctrs_0.7.2      
#> [21] glue_1.7.0        cachem_1.1.0      evaluate_1.0.5    rmarkdown_2.31   
#> [25] compiler_4.2.1    bslib_0.10.0      pillar_1.11.1     generics_0.1.4   
#> [29] otel_0.2.0        jsonlite_2.0.0    pkgconfig_2.0.3

11 References

Chaudhuri, S., Ganjam, K., Ganti, V. and Motwani, R. (2003). Robust and efficient fuzzy match for online data cleaning. Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, 313–324. https://doi.org/10.1145/872757.872796

de Jonge, E. and van der Loo, M. (2013). An introduction to data cleaning with R. Statistics Netherlands Discussion Paper. https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf

Levenshtein, V.I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8), 707–710.

Müller, H. and Freytag, J.C. (2003). Problems, methods, and challenges in comprehensive data cleansing. Technical Report HUB-IB-164, Humboldt University Berlin.

Tukey, J.W. (1977). Exploratory Data Analysis. Addison-Wesley. ISBN: 978-0-201-07616-5.

van der Loo, M.P.J. and de Jonge, E. (2018). Statistical Data Cleaning with Applications in R. John Wiley & Sons. https://doi.org/10.1002/9781118897126

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.