Repository Mirror for your Cloud Server and Webhosting

Type:

Package

Title:

Automated Data Quality Checks for Recurring Dataset Deliveries

Version:

0.1.2

Date:

2026-05-16

Description:

Automates quality verification of recurring external dataset deliveries. For each new file arrival, it runs single-snapshot quality checks, compares the file to the previous delivery, writes a self-contained 'HTML' report, and records summary statistics in a local 'SQLite' database for long-term trend tracking. Supports 'CSV' and fixed-width formats. Custom organisation-specific checks can be supplied as plain R files.

License:

MIT + file LICENSE

URL:

https://github.com/mickmioduszewski/dqcheckr

BugReports:

https://github.com/mickmioduszewski/dqcheckr/issues

Encoding:

UTF-8

Language:

en-GB

Depends:

R (≥ 4.2)

Imports:

readr, DBI, RSQLite, rmarkdown, knitr, kableExtra, ggplot2, gridExtra, dplyr, tidyr, yaml, rlang

Suggests:

testthat (≥ 3.0.0)

VignetteBuilder:

knitr

Config/testthat/edition:

Config/roxygen2/version:

8.0.0

NeedsCompilation:

Packaged:

2026-05-15 16:51:37 UTC; mick

Author:

Mick Mioduszewski [aut, cre]

Maintainer:

Mick Mioduszewski <mick@mioduszewski.net>

Repository:

CRAN

Date/Publication:

2026-05-20 08:00:07 UTC

dqcheckr: Automated Data Quality Checks for Recurring Dataset Deliveries

Description

Automates quality verification of recurring external dataset deliveries. For each new file arrival, it runs single-snapshot quality checks (QC-01 to QC-14, SC-01/SC-02), compares the file to the previous delivery (CP-01 to CP-08), writes a self-contained 'HTML' report, and records summary statistics in a local 'SQLite' database for long-term trend tracking. Supports 'CSV' and fixed-width formats. Custom organisation-specific checks can be supplied as plain R files.

Details

The main entry point is run_dq_check. Configuration is driven by two 'YAML' files: a global dqcheckr.yml and a per-dataset <dataset_name>.yml.

Author(s)

Maintainer: Mick Mioduszewski mick@mioduszewski.net

Authors:

Mick Mioduszewski mick@mioduszewski.net

Compute missing rate for a vector

Description

Compute missing rate for a vector

Usage

.missing_rate_vec(x)

Test for missing or empty values

Description

Test for missing or empty values

Usage

.missing_vals(x)

QC-09: Check for values outside the allowed set

Description

QC-09: Check for values outside the allowed set

Usage

check_allowed_values(df, config)

QC-05: Report column count

Description

QC-05: Report column count

Usage

check_col_count(df, config)

QC-08: Report distinct value counts for character columns

Description

QC-08: Report distinct value counts for character columns

Usage

check_distinct_counts(df, config)

QC-03: Check for fully-duplicate rows

Description

QC-03: Check for fully-duplicate rows

Usage

check_duplicate_rows(df, config)

QC-02: Check for entirely empty columns

Description

QC-02: Check for entirely empty columns

Usage

check_empty_column(df, config)

QC-06: Report inferred column types

Description

QC-06: Report inferred column types

Usage

check_inferred_types(df, config)

QC-12: Check uniqueness of key columns

Description

QC-12: Check uniqueness of key columns

Usage

check_key_uniqueness(df, config)

QC-14: Check minimum row count threshold

Description

QC-14: Check minimum row count threshold

Usage

check_min_row_count(df, config)

QC-01: Check missing rate per column

Description

Returns a dq_result per column flagging columns whose proportion of missing or empty values exceeds max_missing_rate.

Usage

check_missing_rate(df, config)

Arguments

df

A data frame with all columns as character vectors.

config

Named list as returned by load_config.

Value

A list of dq_result objects, one per column.

Examples

cfg_dir <- system.file("demonstrations/config", package = "dqcheckr")
cfg  <- load_config("starwars_csv", config_dir = cfg_dir)
path <- system.file("demonstrations/data/starwars.csv", package = "dqcheckr")
df   <- read_dataset(path, cfg)
check_missing_rate(df, cfg)

QC-11: Check non-numeric rate in numeric columns

Description

QC-11: Check non-numeric rate in numeric columns

Usage

check_non_numeric(df, config)

QC-10: Check for out-of-range numeric values

Description

QC-10: Check for out-of-range numeric values

Usage

check_numeric_bounds(df, config)

QC-07: Report numeric summary statistics

Description

QC-07: Report numeric summary statistics

Usage

check_numeric_stats(df, config)

QC-13: Check values against a regex pattern

Description

QC-13: Check values against a regex pattern

Usage

check_pattern(df, config)

QC-04: Report row count

Description

QC-04: Report row count

Usage

check_row_count(df, config)

SC-01/SC-02: Check columns against expected schema contract

Description

SC-01/SC-02: Check columns against expected schema contract

Usage

check_schema_contract(df, config)

CP-08: Check column order consistency between deliveries

Description

CP-08: Check column order consistency between deliveries

Usage

compare_column_order(df_current, df_previous, config)

CP-06: Detect dropped distinct values in character columns

Description

CP-06: Detect dropped distinct values in character columns

Usage

compare_dropped_values(df_current, df_previous, config)

CP-03: Compare per-column missing rate between deliveries

Description

CP-03: Compare per-column missing rate between deliveries

Usage

compare_missing_rate(df_current, df_previous, config)

CP-05: Detect new distinct values in character columns

Description

CP-05: Detect new distinct values in character columns

Usage

compare_new_values(df_current, df_previous, config)

CP-07: Compare non-numeric rate in numeric columns between deliveries

Description

CP-07: Compare non-numeric rate in numeric columns between deliveries

Usage

compare_non_numeric_rate(df_current, df_previous, config)

CP-04: Compare numeric column means between deliveries

Description

CP-04: Compare numeric column means between deliveries

Usage

compare_numeric_mean(df_current, df_previous, config)

CP-01: Compare row count between deliveries

Description

CP-01: Compare row count between deliveries

Usage

compare_row_count(df_current, df_previous, config)

CP-02: Detect schema differences between deliveries

Description

CP-02: Detect schema differences between deliveries

Usage

compare_schema(df_current, df_previous, config)

Compute per-column statistics for snapshot storage

Description

Compute per-column statistics for snapshot storage

Usage

compute_col_stats(df, config, qc_results)

Detect current and previous dataset files

Description

Resolves the current and previous file paths from the configuration. If current_file is set explicitly, it is used directly. Otherwise the two most recently modified files in folder are used.

Usage

detect_files(config)

Arguments

config

Named list. Merged configuration as returned by load_config.

Value

A named list with elements current (character path) and previous (character path or NULL).

Examples

cfg_dir <- system.file("demonstrations/config", package = "dqcheckr")
cfg <- load_config("starwars_csv", config_dir = cfg_dir)
cfg$current_file <- system.file("demonstrations/data/starwars.csv",
                                 package = "dqcheckr")
files <- detect_files(cfg)
files$current

Construct a data quality result object

Description

Creates the atomic result unit returned by every check function.

Usage

dq_result(
  check_id,
  check_name,
  column = NA_character_,
  status,
  observed,
  threshold = NA_character_,
  message
)

Arguments

check_id

Character. Short identifier for the check (e.g. "QC-01").

check_name

Character. Human-readable name of the check.

column

Character. Column the check applies to, or NA_character_ for row-level or file-level checks.

status

Character. One of "PASS", "WARN", "FAIL", or "INFO".

observed

Character. What was observed (e.g. "5.2% missing").

threshold

Character. The configured threshold, or NA_character_ if not applicable.

message

Character. Human-readable description of the result.

Value

A named list with seven elements: check_id, check_name, column, status, observed, threshold, message.

Examples

dq_result("QC-01", "Missing rate", column = "age",
          status = "PASS", observed = "0% missing",
          message = "No missing values.")

Infer the logical type of a character column

Description

Classifies a character vector as "date", "numeric", "character", or "unknown" by applying rules in priority order.

Usage

infer_col_type(x, threshold = 0.9)

Arguments

x

Character vector to classify (as read from a CSV or FWF file).

threshold

Numeric. Minimum proportion of non-empty values that must parse as numeric for the column to be classified as "numeric". Defaults to 0.90. Configurable via type_inference_threshold in rule_overrides.

Value

A single character string: "date", "numeric", "character", or "unknown".

Examples

infer_col_type(c("2024-01-01", "2024-06-15"))   # "date"
infer_col_type(c("1.5", "2.0", "3.1"))          # "numeric"
infer_col_type(c("high", "low", "medium"))       # "character"
infer_col_type(c(NA, "", NA))                    # "unknown"
infer_col_type(c(rep("1", 17), "a", "b", "c"), threshold = 0.80)  # "numeric"

Initialise the SQLite snapshot database

Description

Initialise the SQLite snapshot database

Usage

init_snapshot_db(db_path)

Load and merge dataset configuration

Description

Reads the global dqcheckr.yml and the dataset-specific YAML, merging rule_overrides from the dataset config on top of default_rules from the global config.

Usage

load_config(dataset_name, config_dir)

Arguments

dataset_name

Character. Dataset name; must match <dataset_name>.yml in config_dir.

config_dir

Character. Path to the directory containing both YAML files.

Value

A named list representing the merged configuration.

Examples

cfg_dir <- system.file("demonstrations/config", package = "dqcheckr")
cfg <- load_config("starwars_csv", config_dir = cfg_dir)
cfg$format

Compute the worst status across a list of dq_result objects

Description

Returns the single worst status in precedence order: "FAIL" > "WARN" > "PASS" > "INFO".

Usage

overall_status(results)

Arguments

results

A list of dq_result objects.

Value

A single character string: "FAIL", "WARN", "PASS", or "INFO".

Examples

r1 <- dq_result("QC-01", "test", status = "PASS", observed = "ok", message = "ok")
r2 <- dq_result("QC-02", "test", status = "WARN", observed = "ok", message = "ok")
overall_status(list(r1, r2))  # "WARN"

Read a dataset file into a data frame

Description

Reads a CSV or fixed-width file, coercing all columns to character and trimming whitespace. Encoding and delimiter are taken from config.

Usage

read_dataset(path, config)

Arguments

path

Character. Path to the file to read.

config

Named list. Merged configuration as returned by load_config. Must include format ("csv" or "fwf"). For FWF files, fwf_widths is required and fwf_col_names and fwf_skip are optional.

Value

A data frame with all columns as character vectors.

Examples

cfg_dir <- system.file("demonstrations/config", package = "dqcheckr")
cfg  <- load_config("starwars_csv", config_dir = cfg_dir)
path <- system.file("demonstrations/data/starwars.csv", package = "dqcheckr")
df   <- read_dataset(path, cfg)

Read recent snapshot history from the SQLite database

Description

Retrieves the n most recent run records for a given dataset from the snapshot database, ordered newest-first.

Usage

read_recent_snapshots(db_path, dataset_name, n = 10)

Arguments

db_path

Character. Path to the SQLite database file.

dataset_name

Character. Dataset name to filter on.

n

Integer. Maximum number of records to return. Defaults to 10.

Value

A data frame with one row per run and columns including id, run_timestamp, file_name, row_count, overall_status, check_pass_count, check_warn_count, check_fail_count. Returns an empty data frame if the database does not exist or contains no records for the dataset.

Examples

history <- read_recent_snapshots(tempfile(fileext = ".sqlite"), "starwars_csv")

Render the HTML data quality report

Description

Render the HTML data quality report

Usage

render_report(
  dataset_name,
  file_name,
  file_path,
  df,
  qc_results,
  cp_results,
  custom_results,
  snapshot_history,
  config,
  col_stats = NULL,
  output_dir,
  open_report = TRUE
)

Run all version comparison checks between two dataset snapshots

Description

Runs CP-01 to CP-08 comparing a current delivery against the previous one.

Usage

run_comparison_checks(df_current, df_previous, config)

Arguments

df_current

A data frame. The current delivery.

df_previous

A data frame. The previous delivery.

config

Named list. Merged configuration as returned by load_config.

Value

A list of dq_result objects. The list carries attributes new_cols and dropped_cols (character vectors) for use by the snapshot writer.

Examples

cfg_dir   <- system.file("demonstrations/config", package = "dqcheckr")
cfg       <- load_config("starwars_csv", config_dir = cfg_dir)
curr_path <- system.file("demonstrations/data2/starwars_v2.csv", package = "dqcheckr")
prev_path <- system.file("demonstrations/data2/starwars_v1.csv", package = "dqcheckr")
curr      <- read_dataset(curr_path, cfg)
prev      <- read_dataset(prev_path, cfg)
results   <- run_comparison_checks(curr, prev, cfg)

Run organisation-specific custom checks

Description

Sources the R file specified by config$custom_checks_file, which must define a function custom_checks(df) returning a list of dq_result objects. Returns an empty list if custom_checks_file is not set in the config.

Usage

run_custom_checks(df, config)

Arguments

df

A data frame. The current delivery.

config

Named list. Merged configuration as returned by load_config.

Value

A list of dq_result objects (may be empty).

Examples

cfg_dir <- system.file("demonstrations/config", package = "dqcheckr")
cfg     <- load_config("starwars_csv", config_dir = cfg_dir)
path    <- system.file("demonstrations/data/starwars.csv", package = "dqcheckr")
df      <- read_dataset(path, cfg)
results <- run_custom_checks(df, cfg)

Run a full data quality check pipeline

Description

Orchestrates the complete dqcheckr pipeline: loads configuration, detects files, runs QC and comparison checks, writes a snapshot to SQLite, and renders an HTML report.

Usage

run_dq_check(dataset_name, config_dir = ".", open_report = TRUE)

Arguments

dataset_name

Character. Name of the dataset; must match a YAML config file <dataset_name>.yml in config_dir.

config_dir

Character. Path to the directory containing dqcheckr.yml and the dataset YAML file. Defaults to ".".

open_report

Logical. Whether to open the HTML report in the browser after rendering (only takes effect in interactive sessions).

Value

Invisibly, a named list with:

status: Overall status string: "PASS", "WARN", "FAIL", or "INFO".
report_path: Absolute path to the rendered HTML report.
snapshot_id: Integer row ID of the snapshot written to SQLite, or NULL if the write failed.

Examples


tmp <- gsub("\\\\", "/", tempdir())
dat <- system.file("demonstrations/data/starwars.csv", package = "dqcheckr")
writeLines(c(
  paste0('snapshot_db: "',       tmp, '/snap.sqlite"'),
  paste0('report_output_dir: "', tmp, '"'),
  'default_rules:',
  '  max_missing_rate: 0.60',
  '  min_row_count: 80'
), file.path(tmp, "dqcheckr.yml"))
writeLines(c(
  'dataset_name: "starwars_csv"',
  paste0('current_file: "', dat, '"'),
  'format: csv',
  'encoding: "UTF-8"',
  'delimiter: ","'
), file.path(tmp, "starwars_csv.yml"))
result <- run_dq_check("starwars_csv", config_dir = tmp, open_report = FALSE)
result$status

Run all generic quality checks on a dataset

Description

Runs the full QC check suite (QC-01 to QC-14, SC-01, SC-02) against a single data frame snapshot.

Usage

run_qc_checks(df, config)

Arguments

df

A data frame with all columns as character vectors (as returned by read_dataset).

config

Named list. Merged configuration as returned by load_config.

Value

A list of dq_result objects.

Examples

cfg_dir <- system.file("demonstrations/config", package = "dqcheckr")
cfg     <- load_config("starwars_csv", config_dir = cfg_dir)
path    <- system.file("demonstrations/data/starwars.csv", package = "dqcheckr")
df      <- read_dataset(path, cfg)
results <- run_qc_checks(df, cfg)

Write a run snapshot to the SQLite database

Description

Write a run snapshot to the SQLite database

Usage

write_snapshot(
  db_path,
  dataset_name,
  file_name,
  df,
  qc_results,
  cp_results,
  custom_results,
  config,
  col_stats = NULL
)

Package {dqcheckr}

dqcheckr: Automated Data Quality Checks for Recurring Dataset Deliveries

Description

Details

Author(s)

See Also

Compute missing rate for a vector

Description

Usage

Test for missing or empty values

Description

Usage

QC-09: Check for values outside the allowed set

Description

Usage

QC-05: Report column count

Description

Usage

QC-08: Report distinct value counts for character columns

Description

Usage

QC-03: Check for fully-duplicate rows

Description

Usage

QC-02: Check for entirely empty columns

Description

Usage

QC-06: Report inferred column types

Description

Usage

QC-12: Check uniqueness of key columns

Description

Usage

QC-14: Check minimum row count threshold

Description

Usage

QC-01: Check missing rate per column

Description

Usage

Arguments

Value

Examples

QC-11: Check non-numeric rate in numeric columns

Description

Usage

QC-10: Check for out-of-range numeric values

Description

Usage

QC-07: Report numeric summary statistics

Description

Usage

QC-13: Check values against a regex pattern

Description

Usage

QC-04: Report row count

Description

Usage

SC-01/SC-02: Check columns against expected schema contract

Description

Usage

CP-08: Check column order consistency between deliveries

Description

Usage

CP-06: Detect dropped distinct values in character columns

Description

Usage

CP-03: Compare per-column missing rate between deliveries

Description

Usage

CP-05: Detect new distinct values in character columns

Description

Usage

CP-07: Compare non-numeric rate in numeric columns between deliveries

Description

Usage

CP-04: Compare numeric column means between deliveries

Description

Usage

CP-01: Compare row count between deliveries

Description