Getting started with dqcheckr

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

dqcheckr automates the verification of recurring external dataset deliveries. For each new file arrival it runs a battery of quality checks, compares the file to the previous delivery, writes a self-contained HTML report, and records summary statistics in a local SQLite database so that quality trends can be tracked over time.

How it works

Your data file                  Your YAML config
      │                                │
      ▼                                ▼
  read_dataset()            load_config()
  (all columns as           (global defaults +
   character, trimmed)       per-dataset rules)
      │                                │
      └──────────────┬─────────────────┘
                     ▼
          run_qc_checks()        ← single-snapshot checks (QC / SC series)
          run_comparison_checks()← vs previous delivery   (CP series)
          run_custom_checks()    ← your own rules
                     │
                     ▼
          write_snapshot()       ← SQLite: run summary + per-column stats
          render_report()        ← self-contained HTML
                     │
                     ▼
          run_dq_check()         ← the one function you call

A data officer runs a single command for each arriving dataset:

run_dq_check("customer_accounts", config_dir = "path/to/configs")

This prints a one-line console summary, writes an HTML report, and returns list(status, report_path, snapshot_id) invisibly.

Installation

dqcheckr is a first-party package — install it directly from source:

devtools::install("path/to/dqr/dqcheckr")
library(dqcheckr)

All dependencies are on CRAN: readr, DBI, RSQLite, rmarkdown, knitr, kableExtra, ggplot2, gridExtra, yaml, dplyr, tidyr, rlang.

Configuration

Two YAML files control every run.

Global config — `dqcheckr.yml`

Placed in your config directory. Sets default thresholds used by all datasets unless a dataset overrides them.

snapshot_db:       "data/snapshots.sqlite"
report_output_dir: "reports/"

default_rules:
  # Proportion of non-empty values that must parse as numeric for a column to
  # be classified as numeric (affects type inference across many checks)
  type_inference_threshold: 0.90

  # FAIL if any column's missing-value rate exceeds this (0-1)
  max_missing_rate: 0.05

  # FAIL if non-numeric values in a numeric column exceed this rate
  max_non_numeric_rate: 0.01

  # FAIL if row count drops below this (0 = disabled)
  min_row_count: 0

  # Version-comparison thresholds (used when a previous file is available)
  max_row_count_change_pct:       0.10
  max_numeric_mean_shift_pct:     0.20
  max_missing_rate_change_pp:     2.0    # percentage points
  max_non_numeric_rate_change_pp: 1.0

  flag_new_columns:         true   # set false to suppress new-column warnings in CP-02
  flag_dropped_columns:     true   # set false to suppress dropped-column warnings in CP-02
  flag_type_changes:        true   # set false to suppress type-change warnings in CP-02
  flag_column_order_change: true   # set false to skip CP-08 entirely (WARN CSV / FAIL FWF)

Per-dataset config — `<dataset_name>.yml`

One file per dataset. Only the keys you specify are needed — everything else inherits from the global defaults.

dataset_name: "customer_accounts"   # for human reference only — the functional
                                     # identifier is the YAML filename, which must
                                     # match the dataset_name argument to run_dq_check()

# --- File location ---
# Option A: folder scan (two most recently modified files are used)
folder: "data/incoming/customer_accounts/"

# Option B: explicit paths (set current_file only → single-file mode,
#           no version comparison)
# current_file:  "data/incoming/customer_accounts/20260410.csv"
# previous_file: "data/incoming/customer_accounts/20260401.csv"

format:    csv
encoding:  "UTF-8"
delimiter: ","

# --- Schema contract (optional) ---
# SC-01 flags extra columns; SC-02 flags missing columns.
expected_columns:
  - id
  - name
  - country_code
  - account_status
  - account_balance
  - created_date

# --- Key columns (optional) ---
# QC-12 checks each for duplicate values.
key_columns:
  - id

# --- Rule overrides (optional) ---
# These replace the matching keys in default_rules for this dataset only.
rule_overrides:
  max_missing_rate:        0.02
  max_row_count_change_pct: 0.05
  min_row_count:           1000

# --- Per-column rules (optional) ---
column_rules:
  country_code:
    allowed_values: ["GB", "US", "DE", "FR"]   # QC-09
    pattern: "^[A-Z]{2}$"                       # QC-13
  account_status:
    allowed_values: ["ACTIVE", "CLOSED", "SUSPENDED"]
  account_balance:
    min_value: 0                                 # QC-10
    max_value: 1000000

# --- Custom checks (optional) ---
# custom_checks_file: "custom/customer_accounts_checks.R"

# --- Output paths (optional, override global) ---
# snapshot_db:       "data/snapshots.sqlite"
# report_output_dir: "reports/"

What is required vs optional

The only fields the package requires to run are the file location (folder or current_file) and the format. Every other config key is optional — the corresponding checks are simply skipped when the key is absent.

Config key	Skipped / default when absent
`key_columns`	QC-12 (key uniqueness) skipped
`expected_columns`	SC-01 and SC-02 (schema contract) skipped
`column_rules.allowed_values`	QC-09 (allowed values) skipped
`column_rules.min_value` / `max_value`	QC-10 (numeric bounds) skipped
`column_rules.pattern`	QC-13 (pattern / regex) skipped
`custom_checks_file`	All custom checks skipped
`previous_file` / second file in folder	All CP-01–CP-08 (version comparison) skipped
`min_row_count` set to 0	QC-14 (minimum row count) skipped
`type_inference_threshold`	Defaults to 0.90 (affects QC-06, QC-07, QC-08, QC-11, CP-02, CP-04, CP-05, CP-06, CP-07)
`flag_new_columns`	Defaults to `true`; set to `false` to suppress new-column warnings in CP-02 (still tracked in SQLite)
`flag_dropped_columns`	Defaults to `true`; set to `false` to suppress dropped-column warnings in CP-02 (still tracked in SQLite)
`flag_type_changes`	Defaults to `true`; set to `false` to suppress type-change warnings in CP-02
`flag_column_order_change`	Defaults to `true`; set to `false` to skip CP-08 entirely

A minimal dataset config that runs basic checks with no column-level rules:

dataset_name: "my_dataset"           # for reference; must match the YAML filename
current_file: "data/incoming/my_dataset.csv"
format: csv

Fixed-width files

For fixed-width files, set format: fwf and supply column widths and names. If the file has a header row, set fwf_skip: 1 to skip it.

format: fwf
fwf_skip: 1
fwf_widths:    [10, 20, 3, 10, 12, 11]
fwf_col_names: [id, name, country_code, account_status, account_balance, created_date]

The quality checks

Each check returns one or more dq_result objects with four key fields: check_id, status (PASS / WARN / FAIL / INFO), observed, and message. The overall run status is the worst single status across all checks.

Single-snapshot checks (QC series)

These run on the current file alone.

ID	Name	Triggers on	Severity
QC-01	Missing rate	Column missing rate > `max_missing_rate`	FAIL
QC-02	Empty column	Column is 100% empty	FAIL
QC-03	Duplicate rows	Any fully-identical rows	WARN
QC-04	Row count	Always	INFO
QC-05	Column count	Always	INFO
QC-06	Inferred type	Always (date / numeric / character / unknown)	INFO
QC-07	Numeric stats	Numeric columns (min, max, mean, SD)	INFO
QC-08	Distinct count	Character columns	INFO
QC-09	Allowed values	Values outside `allowed_values` list	FAIL
QC-10	Numeric bounds	Values outside `min_value` / `max_value`	FAIL
QC-11	Non-numeric values	Non-numeric values in a numeric column	WARN / FAIL
QC-12	Key uniqueness	Duplicate values in a `key_columns` column	FAIL
QC-13	Pattern / regex	Values not matching `pattern`	FAIL
QC-14	Minimum row count	Row count below `min_row_count`	FAIL

Schema contract checks (SC series)

Only run when expected_columns is set.

ID	Name	Triggers on	Severity
SC-01	Unexpected column	Column in file not in `expected_columns`	FAIL
SC-02	Missing expected column	Column in `expected_columns` absent from file	FAIL

Version comparison checks (CP series)

Only run when a previous file is available. Skipped automatically in single-file mode.

ID	Name	Triggers on	Severity
CP-01	Row count change	Change > `max_row_count_change_pct`	WARN
CP-02	Schema diff	Columns added, dropped, or type-changed	WARN
CP-03	Missing rate change	Increase > `max_missing_rate_change_pp` pp	WARN
CP-04	Numeric mean shift	Shift > `max_numeric_mean_shift_pct`	WARN
CP-05	New distinct values	Values in current not seen in previous	INFO
CP-06	Dropped distinct values	Values in previous absent from current	INFO
CP-07	Non-numeric rate change	Increase > `max_non_numeric_rate_change_pp` pp	WARN
CP-08	Column order change	Column order differs from previous	WARN (CSV) / FAIL (FWF)

Type inference

infer_col_type() classifies each column after whitespace trimming. Rules are applied in priority order:

date — all non-empty values parse as a known date format (%Y-%m-%d, %d/%m/%Y, %m/%d/%Y, %Y%m%d, %d-%m-%Y)
numeric — ≥ type_inference_threshold of non-empty values coerce to numeric
character — everything else
unknown — all values are empty or NA

The threshold defaults to 90%, meaning a column with up to 10% bad values is still classified as numeric, making those bad values visible to QC-11 (non-numeric contamination). It is configurable per dataset via type_inference_threshold in default_rules or rule_overrides:

# raise threshold — stricter, fewer columns classified as numeric
rule_overrides:
  type_inference_threshold: 0.95

# lower threshold — more lenient, columns with more noise still classified numeric
rule_overrides:
  type_inference_threshold: 0.80

Changing the threshold affects QC-06, QC-07, QC-08, QC-11, CP-02, CP-04, CP-05, CP-06, and CP-07. Note that changing it between runs may cause CP-02 to report a spurious type change on columns that sit close to the old and new thresholds, even if the underlying data did not change.

infer_col_type(c("2024-01-15", "2024-06-30"))  # "date"
infer_col_type(c("100", "200", "N/A"))          # "character" (only 67% numeric)
infer_col_type(c("100", "200", "N/A", rep("50", 17)))  # "numeric" (95% numeric)
infer_col_type(c(NA, "", NA))                   # "unknown"
# with a custom threshold:
infer_col_type(c(rep("1", 17), "a", "b", "c"), threshold = 0.80)  # "numeric" (85%)

Running a check

result <- run_dq_check(
  dataset_name = "customer_accounts",
  config_dir   = "path/to/configs",   # contains dqcheckr.yml + customer_accounts.yml
  open_report  = TRUE                 # open HTML in browser when interactive
)

# Console output (always printed):
# [dqcheckr] customer_accounts: FAIL - 1 warning(s), 2 failure(s). Report: reports/...html

result$status       # "PASS", "WARN", or "FAIL"
result$report_path  # full path to the HTML file
result$snapshot_id  # integer row ID in snapshots table

The HTML report contains:

Header — dataset name, file name, run time, overall status badge
File summary — row count, column count, file size, format, encoding
Quality checks table — every check result, failures first
Custom checks table — if configured
Version comparison — what changed since the previous delivery
Historical trend — row count and check outcomes across the last 10 runs (appears once at least two runs have been recorded)
Appendix — full column-level statistics table

Calling individual checks

You can call any check function directly, without going through the full pipeline. Every check takes (df, config) and returns a list of dq_result objects.

library(dqcheckr)

# Build a minimal config programmatically
cfg <- list(
  rules = list(
    max_missing_rate     = 0.05,
    max_non_numeric_rate = 0.01,
    min_row_count        = 0
  ),
  column_rules     = list(
    status = list(allowed_values = c("ACTIVE", "CLOSED"))
  ),
  key_columns      = "id",
  expected_columns = NULL
)

df <- read.csv("data/customer_accounts.csv", colClasses = "character")
df[] <- lapply(df, trimws)

# Run a single check
results <- check_missing_rate(df, cfg)

# Inspect
results[[1]]$status    # "PASS" or "FAIL"
results[[1]]$observed  # e.g. "3.2% missing (4 of 125)"

# Run all single-snapshot checks at once
all_results <- run_qc_checks(df, cfg)

# Overall status
overall_status(all_results)  # "FAIL" > "WARN" > "PASS" > "INFO"

Custom checks

Create a plain .R file that defines custom_checks(df). It receives the trimmed data frame and must return a list of dq_result objects.

# File: custom/customer_accounts_checks.R

custom_checks <- function(df) {
  results <- list()

  # Rule: ACTIVE accounts must not have a zero balance
  active_zero <- df[df$account_status == "ACTIVE" &
                    !is.na(df$account_balance) &
                    df$account_balance == "0", ]
  n <- nrow(active_zero)
  results <- c(results, list(dq_result(
    check_id   = "CUST-01",
    check_name = "No zero-balance active accounts",
    column     = "account_balance",   # enables per-column storage in SQLite
    status     = if (n > 0) "FAIL" else "PASS",
    observed   = sprintf("%d ACTIVE account(s) with balance 0", n),
    message    = if (n > 0)
      sprintf("%d ACTIVE account(s) have a zero balance.", n)
    else
      "No ACTIVE accounts have a zero balance."
  )))

  results
}

Point to it in the dataset config:

custom_checks_file: "custom/customer_accounts_checks.R"

The file is sourced in an isolated environment (not the global session). dq_result() is available inside it. Any error — missing file, undefined function, runtime failure — stops the run with a clear message.

Tip: always set column = when your check is scoped to a specific column. Results with column = NA (the default) appear in the HTML report and contribute to the overall status, but are not written to the column_snapshots SQLite table and therefore do not appear in per-column trend queries.

The snapshot database

Every run writes two rows to a shared SQLite database: one row in snapshots (run-level summary) and one row per column-stat in column_snapshots.

library(DBI)
library(RSQLite)

con <- dbConnect(SQLite(), "data/snapshots.sqlite")

# Recent runs for one dataset
dbGetQuery(con,
  "SELECT id, file_name, overall_status, check_fail_count, run_timestamp
   FROM snapshots
   WHERE dataset_name = 'customer_accounts'
   ORDER BY id DESC
   LIMIT 10")

# Column-level stats for the most recent run
dbGetQuery(con,
  "SELECT column_name, dq_check, value, threshold
   FROM column_snapshots
   WHERE snapshot_id = (
     SELECT MAX(id) FROM snapshots WHERE dataset_name = 'customer_accounts'
   )
   ORDER BY column_name, dq_check")

dbDisconnect(con)

Because all datasets share one database, you can query across them:

dbGetQuery(con,
  "SELECT dataset_name, COUNT(*) AS runs,
          SUM(check_fail_count) AS total_failures
   FROM snapshots
   GROUP BY dataset_name")

The output/ directory (and database file) are created automatically on the first run if they do not exist.

Worked example — Star Wars dataset

The following uses dplyr::starwars exported to CSV and fixed-width files.

# Config directory contains:
#   dqcheckr.yml         — global thresholds
#   starwars_csv.yml     — CSV dataset config
#   starwars_fwf.yml     — FWF dataset config

# Run checks on both formats
result_csv <- run_dq_check("starwars_csv", config_dir = "config", open_report = TRUE)
result_fwf <- run_dq_check("starwars_fwf", config_dir = "config", open_report = TRUE)

The Star Wars config sets max_missing_rate: 0.60 because vehicles (87%) and starships (77%) are naturally sparse — most characters have none. Both columns FAIL, which is the correct finding: they are genuinely incomplete and a consumer should be aware before using them.

Expected console output:

[dqcheckr] starwars_csv: FAIL - 0 warning(s), 2 failure(s). Report: output/reports/starwars_csv_....html
[dqcheckr] starwars_fwf: FAIL - 0 warning(s), 2 failure(s). Report: output/reports/starwars_fwf_....html

Error handling

Situation	Behaviour
Config file not found	Stop immediately; no output written
Data file not found	Stop immediately; no output written
Only one file available	Single-file mode; CP checks skipped
File cannot be parsed	Stop with filename and parse error
SC-01 / SC-02 violations	FAIL recorded; run continues; report written
Custom checks file missing	Stop with message
`custom_checks()` not defined	Stop with message
`custom_checks()` runtime error	Stop with R error detail
SQLite write fails	Warning emitted; HTML report still written

The package never modifies, moves, or deletes source files.

Design principles

Owned and evolvable. Every check is plain R — readable, modifiable, and not hidden behind a third-party API.

Plain English outputs. Reports are written for data management staff, not engineers. No R code, no package names, no jargon.

Fail loudly. Configuration errors, missing files, and custom-check failures all stop the run with a clear message. Nothing is silently skipped.

Non-destructive. The package reads files and writes reports and database records. Source files are never touched.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.

Getting started with dqcheckr

How it works

Installation

Configuration

Global config — dqcheckr.yml

Per-dataset config — <dataset_name>.yml

What is required vs optional

Fixed-width files

The quality checks

Single-snapshot checks (QC series)

Schema contract checks (SC series)

Version comparison checks (CP series)

Type inference

Running a check

Calling individual checks

Custom checks

The snapshot database

Worked example — Star Wars dataset

Error handling

Design principles

Global config — `dqcheckr.yml`

Per-dataset config — `<dataset_name>.yml`