dqcheckr — Software Specification

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Mick Mioduszewski

2026-05-10

Part A — Purpose and Intent

Why this exists

The analytics organisation ingests external datasets from multiple suppliers on a recurring basis. Verifying that each new delivery is complete, structurally correct, and consistent with prior deliveries is currently slow and largely manual. Analysts spend time on repetitive visual inspection instead of higher-value work, and quality problems risk being missed or caught late.

What we are building

dqcheckr is a purpose-built R package that automates this verification process. For each dataset arrival it produces a single, plain-English HTML report that tells data management staff:

Whether the file is structurally sound (columns, types, row count as expected).
Whether data values are within acceptable bounds and free of obvious errors.
How this delivery compares to the previous one — what has changed, what has drifted.
How this delivery compares to a defined schema contract, if one exists.

Results are also stored in a local SQLite database so that quality trends can be tracked over time across many deliveries.

Design principles

Owned and evolvable. The tool is first-party R code, not a thin wrapper around a third-party API. Every check is directly readable and modifiable.
Plain English outputs. Reports are written for data management staff, not engineers. No R code, no package names, no jargon.
Fail loudly. Errors in configuration, custom checks, or parsing stop the run with a clear message. Nothing is silently skipped.
Non-destructive. The tool reads files and writes reports and database records. It never modifies, moves, or deletes source files.

Part B — Business-Level Description

B.1 How it is used

A data officer runs a single command for each arriving dataset:

run_dq_check("customer_accounts", config_dir = "config")

The tool finds the two most recent files in the dataset’s folder, checks the newer one, compares it to the older one, writes a timestamped HTML report to a reports folder, and prints a one-line summary to the console. The report opens automatically in the browser.

Files can also be identified explicitly by naming a current and a previous file, rather than relying on folder modification times.

Scale: the tool is designed for up to 10 dataset arrivals per day across all dataset types.

B.2 What is checked

Data completeness

Any column with too many missing values is flagged.
Any column that is entirely empty is an error.
If the file has fewer rows than the configured minimum, it is an error.

Data validity

Columns expected to contain only specific values (e.g. country codes, status flags) are checked against a permitted list. Unexpected values are an error.
Numeric columns expected to fall within a range are checked. Out-of-range values are an error.
Columns expected to match a format (e.g. a postcode pattern) are checked by regular expression. Violations are an error.

Data integrity

Columns designated as keys (e.g. a unique ID column) are checked for duplicates. Duplicates are an error.
Rows that are entirely identical to another row in the same file are flagged as a warning.

Data consistency

The inferred data type of each column is recorded. A type change between deliveries is flagged.
Non-numeric values in a column that should be numeric are flagged (warning if any present; error if too many).

Schema contract (when an expected column list is configured)

Columns present in the file but not in the expected list are an error.
Columns in the expected list but absent from the file are an error.

Custom checks (when a custom checks file is configured for a dataset)

Organisation-specific rules written in R are run against the file. Results appear alongside the standard checks.

B.3 What is compared to the previous delivery

Row count: a large change (increase or decrease) is flagged.
Schema: new columns, dropped columns, and type changes are flagged.
Missing rates: columns where the proportion of missing values has increased significantly are flagged.
Numeric distributions: columns where the average value has shifted significantly are flagged.
Non-numeric contamination: columns where the proportion of bad (non-numeric) values has increased are flagged.
Distinct values: new values that have appeared and values that have disappeared in categorical columns are listed.
Column order: a change in the order of columns is a warning for CSV files and an error for fixed-width files (where column order determines parsing).

B.4 What is produced

HTML report — one file per run, self-contained (no internet required to view). Contains:

A header showing dataset name, file name, run time, and an overall status badge (green = all clear, amber = warnings, red = failures).
A file summary (row count, column count, file size, format, encoding).
A quality checks table listing every check, its status, what was observed, the threshold, and a plain-English explanation. Failures and warnings appear first.
A custom checks table (if configured).
A version comparison section showing what changed since the previous delivery, with a plain-English narrative summary at the top.
A historical trend section showing row count and check outcomes across the last 10 runs (appears once at least two runs have been recorded).
An appendix with a full column-level statistics table.

SQLite database — a single database shared across all datasets. Records summary statistics for every run, enabling long-term trend queries without reading the source files again.

Part C — Technical Specification

C.1 Package structure

dqcheckr/
├── R/
│   ├── run_check.R          # main entry point: run_dq_check()
│   ├── ingest.R             # read CSV / FWF; trim whitespace
│   ├── snapshot.R           # SQLite read/write
│   ├── checks_generic.R     # QC-01..14 and SC-01..02
│   ├── checks_custom.R      # source and call custom_checks(df)
│   ├── compare.R            # CP-01..08
│   ├── report.R             # rmarkdown::render() wrapper
│   ├── utils.R              # dq_result(), load_config(), infer_col_type(), helpers
│   └── dqcheckr-package.R   # package-level documentation
├── inst/
│   ├── templates/report.Rmd
│   ├── config/
│   │   ├── dqcheckr.yml
│   │   └── example_dataset.yml
│   └── demonstrations/
│       ├── demo.R           # named-file mode demo (CSV + FWF)
│       ├── demo2.R          # folder-scan mode demo with version comparison
│       ├── README.md
│       ├── config/          # global + starwars_csv + starwars_fwf configs
│       ├── config2/         # global + starwars_folder configs
│       ├── custom2/         # starwars_custom.R (human-specific checks)
│       ├── data/            # starwars.csv, starwars.fwf
│       └── data2/           # starwars_v1.csv (original), starwars_v2.csv (perturbed)
├── tests/testthat/
│   ├── test-utils.R
│   ├── test-checks.R
│   ├── test-compare.R
│   ├── test-ingest.R
│   ├── test-snapshot.R
│   └── test-integration.R
├── vignettes/
│   ├── dqcheckr.Rmd         # user guide
│   └── specification.Rmd    # this document
├── DESCRIPTION
└── NAMESPACE

C.2 Configuration

Global config — dqcheckr.yml

snapshot_db: "data/snapshots.sqlite"
report_output_dir: "reports/"

default_rules:
  # Single-snapshot thresholds
  max_missing_rate: 0.05            # FAIL if column missing rate exceeds this
  max_non_numeric_rate: 0.01        # FAIL if non-numeric rate in numeric column exceeds this
  min_row_count: 0                  # FAIL if row count below this (0 = disabled)

  # Version comparison thresholds
  max_row_count_change_pct: 0.10
  max_numeric_mean_shift_pct: 0.20
  max_missing_rate_change_pp: 2.0   # percentage points
  max_non_numeric_rate_change_pp: 1.0

  # Schema flags
  flag_new_columns: true            # set false to suppress new-column warnings in CP-02
  flag_dropped_columns: true        # set false to suppress dropped-column warnings in CP-02
  flag_type_changes: true           # set false to suppress type-change warnings in CP-02
  flag_column_order_change: true    # set false to skip CP-08 entirely (WARN CSV / FAIL FWF)

Per-dataset config — <dataset_name>.yml

dataset_name: "customer_accounts"
folder: "data/incoming/customer_accounts/"
format: csv                 # csv | fwf
encoding: "UTF-8"
delimiter: ","              # csv only
fwf_widths: ~               # fwf only
fwf_col_names: ~            # fwf only
fwf_skip: 0                 # fwf only — header rows to skip (default 0)

# Optional: explicit file paths override folder-scan version detection
# current_file: "data/incoming/customer_accounts/20260410.csv"
# previous_file: "data/incoming/customer_accounts/20260401.csv"

# Optional: schema contract
expected_columns:
  - id
  - name
  - country_code
  - account_status
  - account_balance
  - created_date

# Optional: key columns that must be unique
key_columns:
  - id

# Optional rule overrides (merged over global defaults)
rule_overrides:
  max_missing_rate: 0.02
  max_row_count_change_pct: 0.05
  max_non_numeric_rate: 0.005
  min_row_count: 1000

# Optional per-column rules
column_rules:
  country_code:
    allowed_values: ["GB", "US", "DE", "FR"]
    pattern: "^[A-Z]{2}$"
  account_status:
    allowed_values: ["ACTIVE", "CLOSED", "SUSPENDED"]
  account_balance:
    min_value: 0
    max_value: 1000000

# Optional custom checks
custom_checks_file: "custom/customer_accounts_checks.R"

C.3 File version detection

Two modes:

Folder scan (default): the two most recently modified files in folder are used. Newest = current, second-newest = previous.
Explicit naming: if current_file (and optionally previous_file) are set in the dataset config, those paths are used directly. Useful when file naming conventions do not reflect delivery order, or when re-running a check for a specific pair of files.

If only a current file is available (either via folder scan or explicit config), the run proceeds in single-file mode: version comparison and vs-previous schema checks are skipped.

C.4 Ingestion and whitespace trimming

File is read entirely as character columns via readr::read_delim() or readr::read_fwf().
Leading and trailing whitespace is trimmed from every value immediately after reading.
The trimmed data frame is used for all subsequent checks and comparisons. The raw file is never modified.

C.5 Type inference

Applied to each column of the trimmed data frame. Rules applied in order:

Priority	Condition	Inferred type
1	All non-null/non-empty values parse as a known date format	`date`
2	≥ 90% of non-null/non-empty values coerce to numeric	`numeric`
3	Otherwise	`character`
—	All values null or empty	`unknown`

Known date formats tested: %Y-%m-%d, %d/%m/%Y, %m/%d/%Y, %Y%m%d, %d-%m-%Y.

The 90% threshold means a column with up to 10% bad (non-numeric) values is still classified as numeric, making those bad values visible to QC-11.

C.6 Quality checks

Each check returns one or more dq_result objects:

dq_result(check_id, check_name, column = NA, status, observed, threshold = NA, message)
# status: "PASS" | "WARN" | "FAIL" | "INFO"

Single-snapshot checks (QC series)

ID	Name	Applies to	FAIL condition	WARN condition
QC-01	Missing rate	Every column	`missing_rate > max_missing_rate`	—
QC-02	Empty column	Every column	Column is 100% empty	—
QC-03	Duplicate rows	Table	—	Any fully-duplicate rows exist
QC-04	Row count	Table	—	— (INFO)
QC-05	Column count	Table	—	— (INFO)
QC-06	Inferred type	Every column	—	— (INFO)
QC-07	Numeric stats	Numeric columns	—	— (INFO: min/max/mean/SD)
QC-08	Distinct value count	Character columns	—	— (INFO)
QC-09	Allowed values	Configured columns	Any value outside `allowed_values`	—
QC-10	Numeric bounds	Configured columns	Any value outside `min_value`/`max_value`	—
QC-11	Non-numeric values	Numeric columns	`non_numeric_rate > max_non_numeric_rate`	Any non-numeric values present
QC-12	Key uniqueness	Configured `key_columns`	Any duplicate value in key column	—
QC-13	Pattern / regex	Configured columns	Any value not matching `pattern`	—
QC-14	Minimum row count	Table	`row_count < min_row_count` (if > 0)	—

Schema contract checks (SC series) — only when expected_columns is configured

ID	Name	FAIL condition
SC-01	Unexpected column	Column in file but not in `expected_columns`
SC-02	Missing expected column	Column in `expected_columns` but absent from file

Version comparison checks (CP series) — skipped in single-file mode

ID	Name	Flag condition	Severity
CP-01	Row count change	`abs(pct_change) > max_row_count_change_pct`	WARN
CP-02	Schema diff	Columns added, dropped, or type-changed (each independently suppressible via `flag_new_columns`, `flag_dropped_columns`, `flag_type_changes`)	WARN
CP-03	Missing rate change	Increase > `max_missing_rate_change_pp` pp per column	WARN
CP-04	Numeric mean shift	Shift > `max_numeric_mean_shift_pct` per column	WARN
CP-05	New distinct values	Values in current not in previous (categorical columns)	INFO
CP-06	Dropped distinct values	Values in previous not in current (categorical columns)	INFO
CP-07	Non-numeric rate change	Increase > `max_non_numeric_rate_change_pp` pp per column	WARN
CP-08	Column order change	Column order differs from previous (suppressible via `flag_column_order_change: false`)	WARN (CSV) / FAIL (FWF)

CP-02 is the single source of truth for schema differences. The new_cols_vs_previous and missing_cols_vs_previous columns in the snapshots table are derived directly from the CP-02 result — they are not computed separately.

C.7 Custom checks

custom_checks_file in dataset config points to an .R file.
Sourced into an isolated environment (not the global env).
Must define custom_checks(df) returning a list of dq_result objects.
Receives the trimmed data frame.
Errors (missing file, missing function, runtime error) stop the run.
Results appear in the “Custom Checks” report section and are written to column_snapshots (see C.8).

C.8 SQLite schema

Database and tables created automatically on first run via init_snapshot_db().

Table: snapshots

Column	Type	Notes
`id`	INTEGER PK	Auto-increment
`dataset_name`	TEXT
`run_timestamp`	TEXT	`YYYY-MM-DD HH:MM:SS`
`file_name`	TEXT	Basename of current file
`row_count`	INTEGER
`col_count`	INTEGER
`check_pass_count`	INTEGER
`check_warn_count`	INTEGER
`check_fail_count`	INTEGER
`check_info_count`	INTEGER
`overall_status`	TEXT	`FAIL` > `WARN` > `PASS` > `INFO`
`new_cols_vs_previous`	TEXT	Comma-separated; NULL if none or no previous
`missing_cols_vs_previous`	TEXT	Comma-separated; NULL if none or no previous
`new_cols_vs_schema`	TEXT	Comma-separated; NULL if none or `expected_columns` not set
`missing_cols_vs_schema`	TEXT	Comma-separated; NULL if none or `expected_columns` not set

Table: column_snapshots

Column	Type	Notes
`id`	INTEGER PK	Auto-increment
`snapshot_id`	INTEGER FK	→ `snapshots.id`
`column_name`	TEXT	Column name, or check_id for custom checks
`dq_check`	TEXT	Stat name or custom check_id
`value`	TEXT	Always stored as text
`threshold`	TEXT	NULL if no threshold applies
`severity_on_breach`	TEXT	`FAIL`, `WARN`, or NULL if informational

Rows written per column type:

`dq_check`	Written for	`threshold`	`severity_on_breach`
`inferred_type`	All columns	NULL	NULL
`missing_count`	All columns	NULL	NULL
`missing_rate`	All columns	`max_missing_rate`	`FAIL`
`distinct_count`	All columns	NULL	NULL
`numeric_mean`	Numeric only	NULL	NULL
`numeric_sd`	Numeric only	NULL	NULL
`numeric_min`	Numeric only	NULL	NULL
`numeric_max`	Numeric only	NULL	NULL
`non_numeric_count`	Numeric only	NULL	NULL
`non_numeric_rate`	Numeric only	`max_non_numeric_rate`	`FAIL`

Custom check rows: one row per dq_result with a non-NA column value. dq_check = check_id, value = observed, threshold = threshold, severity_on_breach = status if WARN/FAIL, else NULL.

C.9 Main entry point

run_dq_check(
  dataset_name,        # matches <dataset_name>.yml in config_dir
  config_dir = ".",    # contains dqcheckr.yml and <dataset_name>.yml
  open_report = TRUE   # open in browser when interactive
)
# Returns invisibly: list(status, report_path, snapshot_id)

Report filename: <dataset_name>_<YYYYMMDD_HHMMSS>.html

Console output:

[dqcheckr] customer_accounts: FAIL - 0 warning(s), 2 failure(s). Report: reports/customer_accounts_20260410_143022.html

C.10 Error handling

Situation	Behaviour
Folder does not exist	Stop; no output written
No files in folder	Stop; no output written
Explicit file path not found	Stop; no output written
Only one file available	Single-file mode; comparison and vs-previous schema skipped
File cannot be parsed	Stop with filename and parse error
SC-01 / SC-02 violations	FAIL recorded; run continues; report and snapshot written
Custom checks file missing	Stop with message
`custom_checks()` not defined	Stop with message
`custom_checks()` runtime error	Stop with R error detail
SQLite write fails	Warning emitted; HTML report still written
Report render fails	Stop; snapshot already written
`snapshot_db` path missing	Created automatically by `init_snapshot_db()`

C.11 Dependencies

Package	Role
`readr`	CSV / FWF ingestion
`DBI` + `RSQLite`	Snapshot store
`rmarkdown` + `knitr`	Report rendering
`kableExtra`	Styled HTML tables in report
`ggplot2` + `gridExtra`	Trend charts in report
`yaml`	Config parsing
`dplyr`	Data manipulation
`tidyr`	Data reshaping for trend charts
`rlang`	Structured error conditions

All on CRAN. pointblank and dataCompareR were evaluated and not used — all logic is implemented natively.

Part D — Unresolved and Deferred Issues

D.1 Needs design before implementation

#	Issue	Detail
D-01	Explicit file naming — partial spec	The dataset config allows `current_file` and `previous_file` to override folder-scan detection, but the behaviour when one is set and the other is not has not been fully specified. Implemented as: `current_file` set but `previous_file` absent runs in single-file mode.
D-02	CP-08 column order — WARN vs FAIL severity	CP-08 is WARN for CSV and FAIL for FWF. There is currently no way to override the severity per dataset. A `column_order_change_severity` per-dataset override may be needed.
D-03	QC-03 duplicate rows — no key scope	QC-03 checks for fully-identical rows only. Multi-column key uniqueness (two rows with the same `id` but different `amount`) is not yet implemented.

D.2 Backlog — agreed

#	Feature
B-01	Cross-column consistency rules — conditional rules between columns (e.g. if `status = CLOSED` then `close_date` must not be null)
B-02	Date column recency check — flag if the most recent value in a date column is older than N days
B-03	Multi-column key uniqueness — QC-12 currently checks single columns only
B-04	`post_run_hook` parameter — user-supplied function for email/Slack/S3 delivery after a run
B-05	`batch_run(config_dir)` — run all configured datasets in sequence
B-06	Additional input formats — Parquet (`arrow`), Excel, database tables via `DBI`
B-07	CP-03 severity — make WARN vs FAIL configurable (currently always WARN)
B-08	Report themes — organisation branding via a `theme` parameter

D.3 Known limitations

#	Limitation
L-01	~~Type inference uses a fixed 90% threshold~~ — resolved in v0.1.1; configurable via `type_inference_threshold` in `rule_overrides`
L-02	Date formats tested are hardcoded to five common patterns; exotic formats will be misclassified as character
L-03	The report appendix reads from in-memory `col_stats`, not from SQLite. If the SQLite write fails but the report succeeds, the appendix and the database can show slightly different values in edge cases
L-04	Performance target (< 60 seconds for 1 M rows) has not been benchmarked; very wide files (many columns) may be slower than expected

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.