The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
dqcheckr automates the verification of recurring
external dataset deliveries. For each new file arrival it runs a battery
of quality checks, compares the file to the previous delivery, writes a
self-contained HTML report, and records summary statistics in a local
SQLite database so that quality trends can be tracked over time.
Your data file Your YAML config
│ │
▼ ▼
read_dataset() load_config()
(all columns as (global defaults +
character, trimmed) per-dataset rules)
│ │
└──────────────┬─────────────────┘
▼
run_qc_checks() ← single-snapshot checks (QC / SC series)
run_comparison_checks()← vs previous delivery (CP series)
run_custom_checks() ← your own rules
│
▼
write_snapshot() ← SQLite: run summary + per-column stats
render_report() ← self-contained HTML
│
▼
run_dq_check() ← the one function you call
A data officer runs a single command for each arriving dataset:
This prints a one-line console summary, writes an HTML report, and
returns list(status, report_path, snapshot_id)
invisibly.
dqcheckr is a first-party package — install it directly
from source:
All dependencies are on CRAN: readr, DBI,
RSQLite, rmarkdown, knitr,
kableExtra, ggplot2, gridExtra,
yaml, dplyr, tidyr,
rlang.
Two YAML files control every run.
dqcheckr.ymlPlaced in your config directory. Sets default thresholds used by all datasets unless a dataset overrides them.
snapshot_db: "data/snapshots.sqlite"
report_output_dir: "reports/"
default_rules:
# Proportion of non-empty values that must parse as numeric for a column to
# be classified as numeric (affects type inference across many checks)
type_inference_threshold: 0.90
# FAIL if any column's missing-value rate exceeds this (0-1)
max_missing_rate: 0.05
# FAIL if non-numeric values in a numeric column exceed this rate
max_non_numeric_rate: 0.01
# FAIL if row count drops below this (0 = disabled)
min_row_count: 0
# Version-comparison thresholds (used when a previous file is available)
max_row_count_change_pct: 0.10
max_numeric_mean_shift_pct: 0.20
max_missing_rate_change_pp: 2.0 # percentage points
max_non_numeric_rate_change_pp: 1.0
flag_new_columns: true # set false to suppress new-column warnings in CP-02
flag_dropped_columns: true # set false to suppress dropped-column warnings in CP-02
flag_type_changes: true # set false to suppress type-change warnings in CP-02
flag_column_order_change: true # set false to skip CP-08 entirely (WARN CSV / FAIL FWF)<dataset_name>.ymlOne file per dataset. Only the keys you specify are needed — everything else inherits from the global defaults.
dataset_name: "customer_accounts" # for human reference only — the functional
# identifier is the YAML filename, which must
# match the dataset_name argument to run_dq_check()
# --- File location ---
# Option A: folder scan (two most recently modified files are used)
folder: "data/incoming/customer_accounts/"
# Option B: explicit paths (set current_file only → single-file mode,
# no version comparison)
# current_file: "data/incoming/customer_accounts/20260410.csv"
# previous_file: "data/incoming/customer_accounts/20260401.csv"
format: csv
encoding: "UTF-8"
delimiter: ","
# --- Schema contract (optional) ---
# SC-01 flags extra columns; SC-02 flags missing columns.
expected_columns:
- id
- name
- country_code
- account_status
- account_balance
- created_date
# --- Key columns (optional) ---
# QC-12 checks each for duplicate values.
key_columns:
- id
# --- Rule overrides (optional) ---
# These replace the matching keys in default_rules for this dataset only.
rule_overrides:
max_missing_rate: 0.02
max_row_count_change_pct: 0.05
min_row_count: 1000
# --- Per-column rules (optional) ---
column_rules:
country_code:
allowed_values: ["GB", "US", "DE", "FR"] # QC-09
pattern: "^[A-Z]{2}$" # QC-13
account_status:
allowed_values: ["ACTIVE", "CLOSED", "SUSPENDED"]
account_balance:
min_value: 0 # QC-10
max_value: 1000000
# --- Custom checks (optional) ---
# custom_checks_file: "custom/customer_accounts_checks.R"
# --- Output paths (optional, override global) ---
# snapshot_db: "data/snapshots.sqlite"
# report_output_dir: "reports/"The only fields the package requires to run are the file location
(folder or current_file) and the
format. Every other config key is optional — the
corresponding checks are simply skipped when the key is absent.
| Config key | Skipped / default when absent |
|---|---|
key_columns |
QC-12 (key uniqueness) skipped |
expected_columns |
SC-01 and SC-02 (schema contract) skipped |
column_rules.allowed_values |
QC-09 (allowed values) skipped |
column_rules.min_value / max_value |
QC-10 (numeric bounds) skipped |
column_rules.pattern |
QC-13 (pattern / regex) skipped |
custom_checks_file |
All custom checks skipped |
previous_file / second file in folder |
All CP-01–CP-08 (version comparison) skipped |
min_row_count set to 0 |
QC-14 (minimum row count) skipped |
type_inference_threshold |
Defaults to 0.90 (affects QC-06, QC-07, QC-08, QC-11, CP-02, CP-04, CP-05, CP-06, CP-07) |
flag_new_columns |
Defaults to true; set to false to suppress
new-column warnings in CP-02 (still tracked in SQLite) |
flag_dropped_columns |
Defaults to true; set to false to suppress
dropped-column warnings in CP-02 (still tracked in SQLite) |
flag_type_changes |
Defaults to true; set to false to suppress
type-change warnings in CP-02 |
flag_column_order_change |
Defaults to true; set to false to skip
CP-08 entirely |
A minimal dataset config that runs basic checks with no column-level rules:
Each check returns one or more dq_result objects with
four key fields: check_id, status
(PASS / WARN / FAIL /
INFO), observed, and message. The
overall run status is the worst single status across all checks.
These run on the current file alone.
| ID | Name | Triggers on | Severity |
|---|---|---|---|
| QC-01 | Missing rate | Column missing rate > max_missing_rate |
FAIL |
| QC-02 | Empty column | Column is 100% empty | FAIL |
| QC-03 | Duplicate rows | Any fully-identical rows | WARN |
| QC-04 | Row count | Always | INFO |
| QC-05 | Column count | Always | INFO |
| QC-06 | Inferred type | Always (date / numeric / character / unknown) | INFO |
| QC-07 | Numeric stats | Numeric columns (min, max, mean, SD) | INFO |
| QC-08 | Distinct count | Character columns | INFO |
| QC-09 | Allowed values | Values outside allowed_values list |
FAIL |
| QC-10 | Numeric bounds | Values outside min_value / max_value |
FAIL |
| QC-11 | Non-numeric values | Non-numeric values in a numeric column | WARN / FAIL |
| QC-12 | Key uniqueness | Duplicate values in a key_columns column |
FAIL |
| QC-13 | Pattern / regex | Values not matching pattern |
FAIL |
| QC-14 | Minimum row count | Row count below min_row_count |
FAIL |
Only run when expected_columns is set.
| ID | Name | Triggers on | Severity |
|---|---|---|---|
| SC-01 | Unexpected column | Column in file not in expected_columns |
FAIL |
| SC-02 | Missing expected column | Column in expected_columns absent from file |
FAIL |
Only run when a previous file is available. Skipped automatically in single-file mode.
| ID | Name | Triggers on | Severity |
|---|---|---|---|
| CP-01 | Row count change | Change > max_row_count_change_pct |
WARN |
| CP-02 | Schema diff | Columns added, dropped, or type-changed | WARN |
| CP-03 | Missing rate change | Increase > max_missing_rate_change_pp pp |
WARN |
| CP-04 | Numeric mean shift | Shift > max_numeric_mean_shift_pct |
WARN |
| CP-05 | New distinct values | Values in current not seen in previous | INFO |
| CP-06 | Dropped distinct values | Values in previous absent from current | INFO |
| CP-07 | Non-numeric rate change | Increase > max_non_numeric_rate_change_pp pp |
WARN |
| CP-08 | Column order change | Column order differs from previous | WARN (CSV) / FAIL (FWF) |
infer_col_type() classifies each column after whitespace
trimming. Rules are applied in priority order:
%Y-%m-%d, %d/%m/%Y,
%m/%d/%Y, %Y%m%d, %d-%m-%Y)type_inference_threshold
of non-empty values coerce to numericNAThe threshold defaults to 90%, meaning a column with up to 10% bad
values is still classified as numeric, making those bad values visible
to QC-11 (non-numeric contamination). It is configurable per dataset via
type_inference_threshold in default_rules or
rule_overrides:
# raise threshold — stricter, fewer columns classified as numeric
rule_overrides:
type_inference_threshold: 0.95
# lower threshold — more lenient, columns with more noise still classified numeric
rule_overrides:
type_inference_threshold: 0.80Changing the threshold affects QC-06, QC-07, QC-08, QC-11, CP-02, CP-04, CP-05, CP-06, and CP-07. Note that changing it between runs may cause CP-02 to report a spurious type change on columns that sit close to the old and new thresholds, even if the underlying data did not change.
infer_col_type(c("2024-01-15", "2024-06-30")) # "date"
infer_col_type(c("100", "200", "N/A")) # "character" (only 67% numeric)
infer_col_type(c("100", "200", "N/A", rep("50", 17))) # "numeric" (95% numeric)
infer_col_type(c(NA, "", NA)) # "unknown"
# with a custom threshold:
infer_col_type(c(rep("1", 17), "a", "b", "c"), threshold = 0.80) # "numeric" (85%)result <- run_dq_check(
dataset_name = "customer_accounts",
config_dir = "path/to/configs", # contains dqcheckr.yml + customer_accounts.yml
open_report = TRUE # open HTML in browser when interactive
)
# Console output (always printed):
# [dqcheckr] customer_accounts: FAIL - 1 warning(s), 2 failure(s). Report: reports/...html
result$status # "PASS", "WARN", or "FAIL"
result$report_path # full path to the HTML file
result$snapshot_id # integer row ID in snapshots tableThe HTML report contains:
You can call any check function directly, without going through the
full pipeline. Every check takes (df, config) and returns a
list of dq_result objects.
library(dqcheckr)
# Build a minimal config programmatically
cfg <- list(
rules = list(
max_missing_rate = 0.05,
max_non_numeric_rate = 0.01,
min_row_count = 0
),
column_rules = list(
status = list(allowed_values = c("ACTIVE", "CLOSED"))
),
key_columns = "id",
expected_columns = NULL
)
df <- read.csv("data/customer_accounts.csv", colClasses = "character")
df[] <- lapply(df, trimws)
# Run a single check
results <- check_missing_rate(df, cfg)
# Inspect
results[[1]]$status # "PASS" or "FAIL"
results[[1]]$observed # e.g. "3.2% missing (4 of 125)"
# Run all single-snapshot checks at once
all_results <- run_qc_checks(df, cfg)
# Overall status
overall_status(all_results) # "FAIL" > "WARN" > "PASS" > "INFO"Create a plain .R file that defines
custom_checks(df). It receives the trimmed data frame and
must return a list of dq_result objects.
# File: custom/customer_accounts_checks.R
custom_checks <- function(df) {
results <- list()
# Rule: ACTIVE accounts must not have a zero balance
active_zero <- df[df$account_status == "ACTIVE" &
!is.na(df$account_balance) &
df$account_balance == "0", ]
n <- nrow(active_zero)
results <- c(results, list(dq_result(
check_id = "CUST-01",
check_name = "No zero-balance active accounts",
column = "account_balance", # enables per-column storage in SQLite
status = if (n > 0) "FAIL" else "PASS",
observed = sprintf("%d ACTIVE account(s) with balance 0", n),
message = if (n > 0)
sprintf("%d ACTIVE account(s) have a zero balance.", n)
else
"No ACTIVE accounts have a zero balance."
)))
results
}Point to it in the dataset config:
The file is sourced in an isolated environment (not the global
session). dq_result() is available inside it. Any error —
missing file, undefined function, runtime failure — stops the run with a
clear message.
Tip: always set column = when your
check is scoped to a specific column. Results with
column = NA (the default) appear in the HTML report and
contribute to the overall status, but are not written to the
column_snapshots SQLite table and therefore do not appear
in per-column trend queries.
Every run writes two rows to a shared SQLite database: one row in
snapshots (run-level summary) and one row per column-stat
in column_snapshots.
library(DBI)
library(RSQLite)
con <- dbConnect(SQLite(), "data/snapshots.sqlite")
# Recent runs for one dataset
dbGetQuery(con,
"SELECT id, file_name, overall_status, check_fail_count, run_timestamp
FROM snapshots
WHERE dataset_name = 'customer_accounts'
ORDER BY id DESC
LIMIT 10")
# Column-level stats for the most recent run
dbGetQuery(con,
"SELECT column_name, dq_check, value, threshold
FROM column_snapshots
WHERE snapshot_id = (
SELECT MAX(id) FROM snapshots WHERE dataset_name = 'customer_accounts'
)
ORDER BY column_name, dq_check")
dbDisconnect(con)Because all datasets share one database, you can query across them:
dbGetQuery(con,
"SELECT dataset_name, COUNT(*) AS runs,
SUM(check_fail_count) AS total_failures
FROM snapshots
GROUP BY dataset_name")The output/ directory (and database file) are created
automatically on the first run if they do not exist.
The following uses dplyr::starwars exported to CSV and
fixed-width files.
# Config directory contains:
# dqcheckr.yml — global thresholds
# starwars_csv.yml — CSV dataset config
# starwars_fwf.yml — FWF dataset config
# Run checks on both formats
result_csv <- run_dq_check("starwars_csv", config_dir = "config", open_report = TRUE)
result_fwf <- run_dq_check("starwars_fwf", config_dir = "config", open_report = TRUE)The Star Wars config sets max_missing_rate: 0.60 because
vehicles (87%) and starships (77%) are
naturally sparse — most characters have none. Both columns FAIL, which
is the correct finding: they are genuinely incomplete and a consumer
should be aware before using them.
Expected console output:
[dqcheckr] starwars_csv: FAIL - 0 warning(s), 2 failure(s). Report: output/reports/starwars_csv_....html
[dqcheckr] starwars_fwf: FAIL - 0 warning(s), 2 failure(s). Report: output/reports/starwars_fwf_....html
| Situation | Behaviour |
|---|---|
| Config file not found | Stop immediately; no output written |
| Data file not found | Stop immediately; no output written |
| Only one file available | Single-file mode; CP checks skipped |
| File cannot be parsed | Stop with filename and parse error |
| SC-01 / SC-02 violations | FAIL recorded; run continues; report written |
| Custom checks file missing | Stop with message |
custom_checks() not defined |
Stop with message |
custom_checks() runtime error |
Stop with R error detail |
| SQLite write fails | Warning emitted; HTML report still written |
The package never modifies, moves, or deletes source files.
Owned and evolvable. Every check is plain R — readable, modifiable, and not hidden behind a third-party API.
Plain English outputs. Reports are written for data management staff, not engineers. No R code, no package names, no jargon.
Fail loudly. Configuration errors, missing files, and custom-check failures all stop the run with a clear message. Nothing is silently skipped.
Non-destructive. The package reads files and writes reports and database records. Source files are never touched.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.