The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Package {dqcheckr}


Type: Package
Title: Automated Data Quality Checks for Recurring Dataset Deliveries
Version: 0.1.2
Date: 2026-05-16
Description: Automates quality verification of recurring external dataset deliveries. For each new file arrival, it runs single-snapshot quality checks, compares the file to the previous delivery, writes a self-contained 'HTML' report, and records summary statistics in a local 'SQLite' database for long-term trend tracking. Supports 'CSV' and fixed-width formats. Custom organisation-specific checks can be supplied as plain R files.
License: MIT + file LICENSE
URL: https://github.com/mickmioduszewski/dqcheckr
BugReports: https://github.com/mickmioduszewski/dqcheckr/issues
Encoding: UTF-8
Language: en-GB
Depends: R (≥ 4.2)
Imports: readr, DBI, RSQLite, rmarkdown, knitr, kableExtra, ggplot2, gridExtra, dplyr, tidyr, yaml, rlang
Suggests: testthat (≥ 3.0.0)
VignetteBuilder: knitr
Config/testthat/edition: 3
Config/roxygen2/version: 8.0.0
NeedsCompilation: no
Packaged: 2026-05-15 16:51:37 UTC; mick
Author: Mick Mioduszewski [aut, cre]
Maintainer: Mick Mioduszewski <mick@mioduszewski.net>
Repository: CRAN
Date/Publication: 2026-05-20 08:00:07 UTC

dqcheckr: Automated Data Quality Checks for Recurring Dataset Deliveries

Description

Automates quality verification of recurring external dataset deliveries. For each new file arrival, it runs single-snapshot quality checks (QC-01 to QC-14, SC-01/SC-02), compares the file to the previous delivery (CP-01 to CP-08), writes a self-contained 'HTML' report, and records summary statistics in a local 'SQLite' database for long-term trend tracking. Supports 'CSV' and fixed-width formats. Custom organisation-specific checks can be supplied as plain R files.

Details

The main entry point is run_dq_check. Configuration is driven by two 'YAML' files: a global dqcheckr.yml and a per-dataset <dataset_name>.yml.

Author(s)

Maintainer: Mick Mioduszewski mick@mioduszewski.net

Authors:

See Also

Useful links:


Compute missing rate for a vector

Description

Compute missing rate for a vector

Usage

.missing_rate_vec(x)

Test for missing or empty values

Description

Test for missing or empty values

Usage

.missing_vals(x)

QC-09: Check for values outside the allowed set

Description

QC-09: Check for values outside the allowed set

Usage

check_allowed_values(df, config)

QC-05: Report column count

Description

QC-05: Report column count

Usage

check_col_count(df, config)

QC-08: Report distinct value counts for character columns

Description

QC-08: Report distinct value counts for character columns

Usage

check_distinct_counts(df, config)

QC-03: Check for fully-duplicate rows

Description

QC-03: Check for fully-duplicate rows

Usage

check_duplicate_rows(df, config)

QC-02: Check for entirely empty columns

Description

QC-02: Check for entirely empty columns

Usage

check_empty_column(df, config)

QC-06: Report inferred column types

Description

QC-06: Report inferred column types

Usage

check_inferred_types(df, config)

QC-12: Check uniqueness of key columns

Description

QC-12: Check uniqueness of key columns

Usage

check_key_uniqueness(df, config)

QC-14: Check minimum row count threshold

Description

QC-14: Check minimum row count threshold

Usage

check_min_row_count(df, config)

QC-01: Check missing rate per column

Description

Returns a dq_result per column flagging columns whose proportion of missing or empty values exceeds max_missing_rate.

Usage

check_missing_rate(df, config)

Arguments

df

A data frame with all columns as character vectors.

config

Named list as returned by load_config.

Value

A list of dq_result objects, one per column.

Examples

cfg_dir <- system.file("demonstrations/config", package = "dqcheckr")
cfg  <- load_config("starwars_csv", config_dir = cfg_dir)
path <- system.file("demonstrations/data/starwars.csv", package = "dqcheckr")
df   <- read_dataset(path, cfg)
check_missing_rate(df, cfg)


QC-11: Check non-numeric rate in numeric columns

Description

QC-11: Check non-numeric rate in numeric columns

Usage

check_non_numeric(df, config)

QC-10: Check for out-of-range numeric values

Description

QC-10: Check for out-of-range numeric values

Usage

check_numeric_bounds(df, config)

QC-07: Report numeric summary statistics

Description

QC-07: Report numeric summary statistics

Usage

check_numeric_stats(df, config)

QC-13: Check values against a regex pattern

Description

QC-13: Check values against a regex pattern

Usage

check_pattern(df, config)

QC-04: Report row count

Description

QC-04: Report row count

Usage

check_row_count(df, config)

SC-01/SC-02: Check columns against expected schema contract

Description

SC-01/SC-02: Check columns against expected schema contract

Usage

check_schema_contract(df, config)

CP-08: Check column order consistency between deliveries

Description

CP-08: Check column order consistency between deliveries

Usage

compare_column_order(df_current, df_previous, config)

CP-06: Detect dropped distinct values in character columns

Description

CP-06: Detect dropped distinct values in character columns

Usage

compare_dropped_values(df_current, df_previous, config)

CP-03: Compare per-column missing rate between deliveries

Description

CP-03: Compare per-column missing rate between deliveries

Usage

compare_missing_rate(df_current, df_previous, config)

CP-05: Detect new distinct values in character columns

Description

CP-05: Detect new distinct values in character columns

Usage

compare_new_values(df_current, df_previous, config)

CP-07: Compare non-numeric rate in numeric columns between deliveries

Description

CP-07: Compare non-numeric rate in numeric columns between deliveries

Usage

compare_non_numeric_rate(df_current, df_previous, config)

CP-04: Compare numeric column means between deliveries

Description

CP-04: Compare numeric column means between deliveries

Usage

compare_numeric_mean(df_current, df_previous, config)

CP-01: Compare row count between deliveries

Description

CP-01: Compare row count between deliveries

Usage

compare_row_count(df_current, df_previous, config)

CP-02: Detect schema differences between deliveries

Description

CP-02: Detect schema differences between deliveries

Usage

compare_schema(df_current, df_previous, config)

Compute per-column statistics for snapshot storage

Description

Compute per-column statistics for snapshot storage

Usage

compute_col_stats(df, config, qc_results)

Detect current and previous dataset files

Description

Resolves the current and previous file paths from the configuration. If current_file is set explicitly, it is used directly. Otherwise the two most recently modified files in folder are used.

Usage

detect_files(config)

Arguments

config

Named list. Merged configuration as returned by load_config.

Value

A named list with elements current (character path) and previous (character path or NULL).

Examples

cfg_dir <- system.file("demonstrations/config", package = "dqcheckr")
cfg <- load_config("starwars_csv", config_dir = cfg_dir)
cfg$current_file <- system.file("demonstrations/data/starwars.csv",
                                 package = "dqcheckr")
files <- detect_files(cfg)
files$current


Construct a data quality result object

Description

Creates the atomic result unit returned by every check function.

Usage

dq_result(
  check_id,
  check_name,
  column = NA_character_,
  status,
  observed,
  threshold = NA_character_,
  message
)

Arguments

check_id

Character. Short identifier for the check (e.g. "QC-01").

check_name

Character. Human-readable name of the check.

column

Character. Column the check applies to, or NA_character_ for row-level or file-level checks.

status

Character. One of "PASS", "WARN", "FAIL", or "INFO".

observed

Character. What was observed (e.g. "5.2% missing").

threshold

Character. The configured threshold, or NA_character_ if not applicable.

message

Character. Human-readable description of the result.

Value

A named list with seven elements: check_id, check_name, column, status, observed, threshold, message.

Examples

dq_result("QC-01", "Missing rate", column = "age",
          status = "PASS", observed = "0% missing",
          message = "No missing values.")


Infer the logical type of a character column

Description

Classifies a character vector as "date", "numeric", "character", or "unknown" by applying rules in priority order.

Usage

infer_col_type(x, threshold = 0.9)

Arguments

x

Character vector to classify (as read from a CSV or FWF file).

threshold

Numeric. Minimum proportion of non-empty values that must parse as numeric for the column to be classified as "numeric". Defaults to 0.90. Configurable via type_inference_threshold in rule_overrides.

Value

A single character string: "date", "numeric", "character", or "unknown".

Examples

infer_col_type(c("2024-01-01", "2024-06-15"))   # "date"
infer_col_type(c("1.5", "2.0", "3.1"))          # "numeric"
infer_col_type(c("high", "low", "medium"))       # "character"
infer_col_type(c(NA, "", NA))                    # "unknown"
infer_col_type(c(rep("1", 17), "a", "b", "c"), threshold = 0.80)  # "numeric"


Initialise the SQLite snapshot database

Description

Initialise the SQLite snapshot database

Usage

init_snapshot_db(db_path)

Load and merge dataset configuration

Description

Reads the global dqcheckr.yml and the dataset-specific YAML, merging rule_overrides from the dataset config on top of default_rules from the global config.

Usage

load_config(dataset_name, config_dir)

Arguments

dataset_name

Character. Dataset name; must match <dataset_name>.yml in config_dir.

config_dir

Character. Path to the directory containing both YAML files.

Value

A named list representing the merged configuration.

Examples

cfg_dir <- system.file("demonstrations/config", package = "dqcheckr")
cfg <- load_config("starwars_csv", config_dir = cfg_dir)
cfg$format


Compute the worst status across a list of dq_result objects

Description

Returns the single worst status in precedence order: "FAIL" > "WARN" > "PASS" > "INFO".

Usage

overall_status(results)

Arguments

results

A list of dq_result objects.

Value

A single character string: "FAIL", "WARN", "PASS", or "INFO".

Examples

r1 <- dq_result("QC-01", "test", status = "PASS", observed = "ok", message = "ok")
r2 <- dq_result("QC-02", "test", status = "WARN", observed = "ok", message = "ok")
overall_status(list(r1, r2))  # "WARN"


Read a dataset file into a data frame

Description

Reads a CSV or fixed-width file, coercing all columns to character and trimming whitespace. Encoding and delimiter are taken from config.

Usage

read_dataset(path, config)

Arguments

path

Character. Path to the file to read.

config

Named list. Merged configuration as returned by load_config. Must include format ("csv" or "fwf"). For FWF files, fwf_widths is required and fwf_col_names and fwf_skip are optional.

Value

A data frame with all columns as character vectors.

Examples

cfg_dir <- system.file("demonstrations/config", package = "dqcheckr")
cfg  <- load_config("starwars_csv", config_dir = cfg_dir)
path <- system.file("demonstrations/data/starwars.csv", package = "dqcheckr")
df   <- read_dataset(path, cfg)


Read recent snapshot history from the SQLite database

Description

Retrieves the n most recent run records for a given dataset from the snapshot database, ordered newest-first.

Usage

read_recent_snapshots(db_path, dataset_name, n = 10)

Arguments

db_path

Character. Path to the SQLite database file.

dataset_name

Character. Dataset name to filter on.

n

Integer. Maximum number of records to return. Defaults to 10.

Value

A data frame with one row per run and columns including id, run_timestamp, file_name, row_count, overall_status, check_pass_count, check_warn_count, check_fail_count. Returns an empty data frame if the database does not exist or contains no records for the dataset.

Examples

history <- read_recent_snapshots(tempfile(fileext = ".sqlite"), "starwars_csv")


Render the HTML data quality report

Description

Render the HTML data quality report

Usage

render_report(
  dataset_name,
  file_name,
  file_path,
  df,
  qc_results,
  cp_results,
  custom_results,
  snapshot_history,
  config,
  col_stats = NULL,
  output_dir,
  open_report = TRUE
)

Run all version comparison checks between two dataset snapshots

Description

Runs CP-01 to CP-08 comparing a current delivery against the previous one.

Usage

run_comparison_checks(df_current, df_previous, config)

Arguments

df_current

A data frame. The current delivery.

df_previous

A data frame. The previous delivery.

config

Named list. Merged configuration as returned by load_config.

Value

A list of dq_result objects. The list carries attributes new_cols and dropped_cols (character vectors) for use by the snapshot writer.

Examples

cfg_dir   <- system.file("demonstrations/config", package = "dqcheckr")
cfg       <- load_config("starwars_csv", config_dir = cfg_dir)
curr_path <- system.file("demonstrations/data2/starwars_v2.csv", package = "dqcheckr")
prev_path <- system.file("demonstrations/data2/starwars_v1.csv", package = "dqcheckr")
curr      <- read_dataset(curr_path, cfg)
prev      <- read_dataset(prev_path, cfg)
results   <- run_comparison_checks(curr, prev, cfg)


Run organisation-specific custom checks

Description

Sources the R file specified by config$custom_checks_file, which must define a function custom_checks(df) returning a list of dq_result objects. Returns an empty list if custom_checks_file is not set in the config.

Usage

run_custom_checks(df, config)

Arguments

df

A data frame. The current delivery.

config

Named list. Merged configuration as returned by load_config.

Value

A list of dq_result objects (may be empty).

Examples

cfg_dir <- system.file("demonstrations/config", package = "dqcheckr")
cfg     <- load_config("starwars_csv", config_dir = cfg_dir)
path    <- system.file("demonstrations/data/starwars.csv", package = "dqcheckr")
df      <- read_dataset(path, cfg)
results <- run_custom_checks(df, cfg)


Run a full data quality check pipeline

Description

Orchestrates the complete dqcheckr pipeline: loads configuration, detects files, runs QC and comparison checks, writes a snapshot to SQLite, and renders an HTML report.

Usage

run_dq_check(dataset_name, config_dir = ".", open_report = TRUE)

Arguments

dataset_name

Character. Name of the dataset; must match a YAML config file <dataset_name>.yml in config_dir.

config_dir

Character. Path to the directory containing dqcheckr.yml and the dataset YAML file. Defaults to ".".

open_report

Logical. Whether to open the HTML report in the browser after rendering (only takes effect in interactive sessions).

Value

Invisibly, a named list with:

status

Overall status string: "PASS", "WARN", "FAIL", or "INFO".

report_path

Absolute path to the rendered HTML report.

snapshot_id

Integer row ID of the snapshot written to SQLite, or NULL if the write failed.

Examples


tmp <- gsub("\\\\", "/", tempdir())
dat <- system.file("demonstrations/data/starwars.csv", package = "dqcheckr")
writeLines(c(
  paste0('snapshot_db: "',       tmp, '/snap.sqlite"'),
  paste0('report_output_dir: "', tmp, '"'),
  'default_rules:',
  '  max_missing_rate: 0.60',
  '  min_row_count: 80'
), file.path(tmp, "dqcheckr.yml"))
writeLines(c(
  'dataset_name: "starwars_csv"',
  paste0('current_file: "', dat, '"'),
  'format: csv',
  'encoding: "UTF-8"',
  'delimiter: ","'
), file.path(tmp, "starwars_csv.yml"))
result <- run_dq_check("starwars_csv", config_dir = tmp, open_report = FALSE)
result$status



Run all generic quality checks on a dataset

Description

Runs the full QC check suite (QC-01 to QC-14, SC-01, SC-02) against a single data frame snapshot.

Usage

run_qc_checks(df, config)

Arguments

df

A data frame with all columns as character vectors (as returned by read_dataset).

config

Named list. Merged configuration as returned by load_config.

Value

A list of dq_result objects.

Examples

cfg_dir <- system.file("demonstrations/config", package = "dqcheckr")
cfg     <- load_config("starwars_csv", config_dir = cfg_dir)
path    <- system.file("demonstrations/data/starwars.csv", package = "dqcheckr")
df      <- read_dataset(path, cfg)
results <- run_qc_checks(df, cfg)


Write a run snapshot to the SQLite database

Description

Write a run snapshot to the SQLite database

Usage

write_snapshot(
  db_path,
  dataset_name,
  file_name,
  df,
  qc_results,
  cp_results,
  custom_results,
  config,
  col_stats = NULL
)

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.