The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Diagnostic Functions Guide
library(tidyaudit)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
tidyaudit includes tidyverse ports of the diagnostic functions from
dtaudit. These
functions help you understand joins, validate keys, compare tables,
diagnose missing values and string quality, and filter with full
visibility.
Join diagnostics
validate_join() analyzes a potential join
without performing it, reporting match rates,
relationship type, duplicate keys, and unmatched rows.
orders <- data.frame(
id = c(1L, 2L, 3L, 3L, 4L, 5L),
amount = c(100, 200, 150, 175, 300, 50)
)
customers <- data.frame(
id = c(2L, 3L, 6L),
name = c("Alice", "Bob", "Carol")
)
validate_join(orders, customers, by = "id")
#>
#> ── Join Validation: orders ↔ customers ─────────────────────────────────────────
#> Keys in orders: id
#> Keys in customers: id
#>
#> Item Value
#> ─────────────────────────────────────────── ───────────
#> Relationship many-to-one
#> Key(s) in orders [id] (1 col)
#> Key(s) in customers [id] (1 col)
#> Rows in orders 6
#> Distinct key combos in orders 5
#> Rows in customers 3
#> Distinct key combos in customers 3
#> Overlapping distinct key combos 2
#> Matched row pairs (cartesian) 3
#> Match rate from orders 50.00%
#> Match rate from customers 66.67%
#> Rows only in orders (no match in customers) 3
#> Rows only in customers (no match in orders) 1
#>
#> Duplicates: orders=yes customers=no
Different key names
When the key columns have different names, use a named vector:
products <- data.frame(prod_id = 1:3, price = c(10, 20, 30))
sales <- data.frame(item_id = c(1L, 1L, 2L), qty = c(5, 3, 7))
validate_join(products, sales, by = c("prod_id" = "item_id"))
#>
#> ── Join Validation: products ↔ sales ───────────────────────────────────────────
#> Keys in products: prod_id
#> Keys in sales: item_id
#>
#> Item Value
#> ───────────────────────────────────────── ───────────
#> Relationship one-to-many
#> Key(s) in products [prod_id] (1 col)
#> Key(s) in sales [item_id] (1 col)
#> Rows in products 3
#> Distinct key combos in products 3
#> Rows in sales 3
#> Distinct key combos in sales 2
#> Overlapping distinct key combos 2
#> Matched row pairs (cartesian) 3
#> Match rate from products 66.67%
#> Match rate from sales 100.00%
#> Rows only in products (no match in sales) 1
#> Rows only in sales (no match in products) 0
#>
#> Duplicates: products=no sales=yes
Stat tracking
Track the impact on a numeric column with stat (same
column name in both tables) or stat_x/stat_y
(different column names):
x <- data.frame(id = 1:4, revenue = c(100, 200, 300, 400))
y <- data.frame(id = c(2L, 3L, 5L), cost = c(10, 20, 30))
validate_join(x, y, by = "id", stat_x = "revenue", stat_y = "cost")
#>
#> ── Join Validation: x ↔ y ──────────────────────────────────────────────────────
#> Keys in x: id
#> Keys in y: id
#>
#> Item Value
#> ─────────────────────────────── ──────────
#> Relationship one-to-one
#> Key(s) in x [id] (1 col)
#> Key(s) in y [id] (1 col)
#> Rows in x 4
#> Distinct key combos in x 4
#> Rows in y 3
#> Distinct key combos in y 3
#> Overlapping distinct key combos 2
#> Matched row pairs (cartesian) 2
#> Match rate from x 50.00%
#> Match rate from y 66.67%
#> Rows only in x (no match in y) 2
#> Rows only in y (no match in x) 1
#>
#> ── Stat diagnostics ────────────────────────────────────────────────────────────
#>
#> revenue in x:
#> • Total: 1,000
#> • Matched: 500 (50.00%)
#> • Unmatched: 500 (50.00%)
#>
#> cost in y:
#> • Total: 60
#> • Matched: 30 (50.00%)
#> • Unmatched: 30 (50.00%)
#>
#> Duplicates: x=no y=no
Key validation
Primary keys
validate_primary_keys() tests whether a set of columns
uniquely identify every row:
df <- data.frame(
id = c(1L, 2L, 3L, 3L, 4L),
group = c("A", "A", "B", "C", "A"),
value = c(10, 20, 30, 40, 50)
)
# Single column — not unique
validate_primary_keys(df, "id")
#>
#> ── Primary Key Validation ──────────────────────────────────────────────────────
#> Table: df
#> Key column: id
#>
#> Metric Value
#> ─────────────────────── ─────
#> Total rows 5
#> Unique key combinations 4
#> Duplicate key combos 1
#>
#> ✖ NO - Keys do NOT uniquely identify all rows.
#>
#> Duplicate keys (showing up to 10):
#> id n
#> 1 3 2
# Composite key — unique
validate_primary_keys(df, c("id", "group"))
#>
#> ── Primary Key Validation ──────────────────────────────────────────────────────
#> Table: df
#> Key columns: id and group
#>
#> Metric Value
#> ─────────────────────── ─────
#> Total rows 5
#> Unique key combinations 5
#> Duplicate key combos 0
#>
#> ✔ YES - Keys uniquely identify all rows.
Variable relationships
validate_var_relationship() determines the relationship
between two columns:
df2 <- data.frame(
dept = c("Sales", "Sales", "Engineering", "Engineering"),
manager = c("Ann", "Ann", "Bob", "Bob")
)
validate_var_relationship(df2, "dept", "manager")
#>
#> ── Variable Relationship Validation ────────────────────────────────────────────
#> Table: df2
#> Variables: dept ↔ manager
#>
#> Metric Value
#> ──────────────────────────── ─────
#> Unique values in dept 2
#> Unique values in manager 2
#> Unique (dept, manager) pairs 2
#>
#> dept → manager: one-to-one
#> manager → dept: one-to-one
#>
#> Relationship: ONE-TO-ONE
Table comparison
compare_tables() compares two data.frames by examining
columns, row counts, key overlap, and numeric discrepancies:
before <- data.frame(id = 1:5, value = c(10.0, 20.0, 30.0, 40.0, 50.0))
after <- data.frame(id = 1:5, value = c(10.0, 22.5, 30.0, 40.0, 55.0))
compare_tables(before, after)
#>
#> ── Table Comparison: before vs after ───────────────────────────────────────────
#> 1. Row counts
#> before: 5 rows
#> after: 5 rows
#> Difference: =
#>
#> 2. Column names
#> Matching columns: 2
#> Only in before: 0
#> Only in after: 0
#> Type mismatches: 0
#>
#> 3. Key columns
#> Key columns: id (auto-detected)
#> Distinct combos in before: 5
#> Distinct combos in after: 5
#> Matching combos: 5
#> Only in before: 0
#> Only in after: 0
#>
#> 4. Numeric discrepancies (absolute differences)
#> Comparing after merging on keys.
#> Rows matched: 5
#>
#> Column N Min Q25 Median Q75 Max
#> ────── ─ ─── ─── ────── ─── ───
#> value 5 0 0 0 2.5 5
Filter diagnostics
filter_keep() and filter_drop() filter data
while printing diagnostics about what was removed.
filter_keep
Keeps rows where the condition is TRUE (same as
dplyr::filter()):
sales <- data.frame(
id = 1:10,
amount = c(500, 25, 1200, 80, 3000, 15, 750, 40, 2000, 60),
status = rep(c("valid", "suspect"), 5)
)
result <- filter_keep(sales, amount > 100, .stat = amount)
#> filter_keep(sales, amount > 100)
#> Dropped 5 of 10 rows (50.00%).
#> Dropped 220 of 7,670 for amount (2.87%).
filter_drop
Drops rows where the condition is TRUE (the
inverse):
result2 <- filter_drop(sales, status == "suspect", .stat = amount)
#> filter_drop(sales, status == "suspect")
#> Dropped 5 of 10 rows (50.00%).
#> Dropped 220 of 7,670 for amount (2.87%).
Warning thresholds
Set .warn_threshold to get a warning when too many rows
are dropped:
filter_keep(sales, amount > 1000, .stat = amount, .warn_threshold = 0.5)
#> filter_keep(sales, amount > 1000)
#> Dropped 7 of 10 rows (70.00%).
#> Dropped 1,470 of 7,670 for amount (19.17%).
#> Warning: Dropped 70.0% of rows exceeds threshold (50.0%).
#> id amount status
#> 1 3 1200 valid
#> 2 5 3000 valid
#> 3 9 2000 valid
Data quality
Missing value diagnosis
diagnose_nas() reports NA counts and percentages for
every column:
messy <- data.frame(
id = 1:6,
name = c("A", NA, "C", "D", NA, "F"),
score = c(10, 20, NA, NA, 50, NA),
grade = c("A", "B", "C", NA, "A", "B")
)
diagnose_nas(messy)
#>
#> ── Missing Value Diagnosis ─────────────────────────────────────────────────────
#> 3 of 4 columns have missing values
#>
#> Variable N NA Pct NA
#> ──────── ──── ──────
#> score 3 50.0%
#> name 2 33.3%
#> grade 1 16.7%
Column summaries
summarize_column() gives type-appropriate statistics for
a single vector:
summarize_column(c(1, 2, 3, NA, 5, 10, 100))
#> type n_unique missing most_frequent
#> "numeric" "6" "1" NA
#> mean sd min q25
#> "20.1666666666667" "39.2398606861271" "1" "2.25"
#> q50 q75 max example1
#> "4" "8.75" "100" "1"
#> example2 example3
#> "2" "3"
summarize_column(c("apple", "banana", "apple", "cherry", NA))
#> type n_unique missing most_frequent mean
#> "character" "3" "1" "apple" NA
#> sd min q25 q50 q75
#> NA "apple" NA NA NA
#> max example1 example2 example3
#> "cherry" "apple" "banana" "apple"
get_summary_table() applies this to all columns (or
selected ones):
get_summary_table(messy)
#> variable type n_unique missing most_frequent mean
#> 1 id numeric 6 0 <NA> 3.5
#> 2 name character 4 2 A <NA>
#> 3 score numeric 3 3 <NA> 26.6666666666667
#> 4 grade character 3 1 A <NA>
#> sd min q25 q50 q75 max example1 example2 example3
#> 1 1.87082869338697 1 2.25 3.5 4.75 6 1 2 3
#> 2 <NA> A <NA> <NA> <NA> F A C D
#> 3 20.8166599946613 10 15 20 35 50 10 20 50
#> 4 <NA> A <NA> <NA> <NA> C A B C
String cleaning
These functions require the stringi package (listed
in Suggests).
diagnose_strings
diagnose_strings() audits a character vector for common
quality issues:
firms <- c("Apple", "APPLE", "apple", " Microsoft ", "Google", NA, "")
diagnose_strings(firms)
#>
#> ── String Column Diagnosis: firms ──────────────────────────────────────────────
#> Total elements: 7
#>
#> Missing & Empty:
#> • NA values: 1 (14.3%)
#> • Empty strings: 1 (14.3%)
#> • Whitespace-only: 0 (0.0%)
#>
#> Whitespace Issues:
#> • Leading whitespace: 1
#> • Trailing whitespace: 1
#>
#> Encoding:
#> • Non-ASCII chars: 0
#>
#> Case Inconsistencies:
#> • Variant groups: 1
#> • Total variants: 3
#>
#> Case variant examples (up to 5 groups):
#> lower n_variants examples
#> apple 3 Apple, APPLE, apple
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.