The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
You receive monthly customer exports from a CRM system. The data
should have unique customer_id values and complete
email addresses. One month, someone upstream changes the
export logic. Now customer_id has duplicates and some
emails are missing.
Without explicit checks, you won’t notice until something breaks downstream—wrong row counts after a join, duplicated invoices, failed email campaigns.
# January export: clean data
january <- data.frame(
customer_id = c(101, 102, 103, 104, 105),
email = c("alice@example.com", "bob@example.com", "carol@example.com",
"dave@example.com", "eve@example.com"),
segment = c("premium", "basic", "premium", "basic", "premium")
)
# February export: corrupted upstream (duplicates + missing email)
february <- data.frame(
customer_id = c(101, 102, 102, 104, 105), # Note: 102 is duplicated
email = c("alice@example.com", "bob@example.com", NA,
"dave@example.com", "eve@example.com"),
segment = c("premium", "basic", "basic", "basic", "premium")
)The February data looks fine at a glance:
head(february)
#> customer_id email segment
#> 1 101 alice@example.com premium
#> 2 102 bob@example.com basic
#> 3 102 <NA> basic
#> 4 104 dave@example.com basic
#> 5 105 eve@example.com premium
nrow(february) # Same row count
#> [1] 5But it will silently corrupt your analysis.
keyed catches these issues by making your assumptions explicit:
# Define what you expect: customer_id is unique
january_keyed <- january |>
key(customer_id) |>
lock_no_na(email)
# This works - January data is clean
january_keyed
#> # A keyed tibble: 5 x 3
#> # Key: customer_id
#> customer_id email segment
#> <dbl> <chr> <chr>
#> 1 101 alice@example.com premium
#> 2 102 bob@example.com basic
#> 3 103 carol@example.com premium
#> 4 104 dave@example.com basic
#> 5 105 eve@example.com premiumNow try the same with February’s corrupted data:
# This fails immediately - duplicates detected
february |>
key(customer_id)
#> Warning: Key is not unique.
#> ℹ 1 duplicate key value(s) found.
#> ℹ Key columns: customer_id
#> # A keyed tibble: 5 x 3
#> # Key: customer_id
#> customer_id email segment
#> <dbl> <chr> <chr>
#> 1 101 alice@example.com premium
#> 2 102 bob@example.com basic
#> 3 102 <NA> basic
#> 4 104 dave@example.com basic
#> 5 105 eve@example.com premiumThe error catches the problem at import time, not downstream when you’re debugging a mysterious row count mismatch.
Goal: Validate each month’s export against expected constraints before processing.
Challenge: Data quality varies month-to-month. Silent corruption causes cascading errors.
Strategy: Define keys and assumptions once, apply consistently to each import.
validate_customer_export <- function(df) {
df |>
key(customer_id) |>
lock_no_na(email) |>
lock_nrow(min = 1)
}
# January: passes
january_clean <- validate_customer_export(january)
summary(january_clean)
#>
#> ── Keyed Data Frame Summary
#> Dimensions: 5 rows x 3 columns
#>
#> Key columns: customer_id
#> ✔ Key is unique
#>
#> Row IDs: noneOnce defined, keys persist through dplyr operations:
# Filter preserves key
premium_customers <- january_clean |>
filter(segment == "premium")
has_key(premium_customers)
#> [1] TRUE
get_key_cols(premium_customers)
#> [1] "customer_id"
# Mutate preserves key
enriched <- january_clean |>
mutate(domain = sub(".*@", "", email))
has_key(enriched)
#> [1] TRUEIf an operation breaks uniqueness, keyed errors and tells you to use
unkey() first:
# This creates duplicates - keyed stops you
january_clean |>
mutate(customer_id = 1)
#> Error in `mutate()`:
#> ! Key is no longer unique after transformation.
#> ℹ Use `unkey()` first if you intend to break uniqueness.To proceed, you must explicitly acknowledge breaking the key:
january_clean |>
unkey() |>
mutate(customer_id = 1)
#> # A tibble: 5 × 3
#> customer_id email segment
#> <dbl> <chr> <chr>
#> 1 1 alice@example.com premium
#> 2 1 bob@example.com basic
#> 3 1 carol@example.com premium
#> 4 1 dave@example.com basic
#> 5 1 eve@example.com premiumGoal: Join customer data with orders without accidentally duplicating rows.
Challenge: Join cardinality mistakes are common and hard to debug. A “one-to-one” join that’s actually one-to-many silently inflates your data.
Strategy: Use diagnose_join() to
understand cardinality before joining.
customers <- data.frame(
customer_id = 1:5,
name = c("Alice", "Bob", "Carol", "Dave", "Eve"),
tier = c("gold", "silver", "gold", "bronze", "silver")
) |>
key(customer_id)
orders <- data.frame(
order_id = 1:8,
customer_id = c(1, 1, 2, 3, 3, 3, 4, 5),
amount = c(100, 150, 200, 50, 75, 125, 300, 80)
) |>
key(order_id)diagnose_join(customers, orders, by = "customer_id", use_joinspy = FALSE)
#>
#> ── Join Diagnosis
#> Cardinality: one-to-many
#> x: 5 rows, unique
#> y: 8 rows, 3 duplicatesThe diagnosis shows:
Cardinality is one-to-many: Each customer can have multiple orders
Coverage: Shows how many keys match vs. don’t match
Now you know what to expect. A left_join() will create 8
rows (one per order), not 5 (one per customer).
compare_keys(customers, orders)
#>
#> ── Key Comparison
#> Comparing on: customer_id
#>
#> x: 5 unique keys
#> y: 5 unique keys
#>
#> Common: 5 (100.0% of x)
#> Only in x: 0
#> Only in y: 0This shows the join key exists in both tables but with different uniqueness properties—essential information before joining.
Goal: Track which original rows survive through a complex pipeline.
Challenge: After filtering, aggregating, and joining, you lose track of which source rows contributed to your final data.
Strategy: Use add_id() to attach stable
identifiers that survive transformations.
# Add UUIDs to rows
customers_tracked <- customers |>
add_id()
customers_tracked
#> # A keyed tibble: 5 x 4
#> # Key: customer_id | .id
#> .id customer_id name tier
#> <chr> <int> <chr> <chr>
#> 1 e87304fc-09ed-4634-8caa-a9d9cf2352cc 1 Alice gold
#> 2 d4c8b392-666d-43e1-8178-ce8e01efd218 2 Bob silver
#> 3 149ca4bd-d304-46a8-822b-fe344600d006 3 Carol gold
#> 4 d6031d5d-90eb-44f7-96a8-db4a0ef6da72 4 Dave bronze
#> 5 2aad2788-779c-48fb-86de-8dfd71222a2c 5 Eve silver# Filter: IDs persist
gold_customers <- customers_tracked |>
filter(tier == "gold")
get_id(gold_customers)
#> [1] "e87304fc-09ed-4634-8caa-a9d9cf2352cc"
#> [2] "149ca4bd-d304-46a8-822b-fe344600d006"
# Compare with original
compare_ids(customers_tracked, gold_customers)
#> $lost
#> [1] "d4c8b392-666d-43e1-8178-ce8e01efd218"
#> [2] "d6031d5d-90eb-44f7-96a8-db4a0ef6da72"
#> [3] "2aad2788-779c-48fb-86de-8dfd71222a2c"
#>
#> $gained
#> character(0)
#>
#> $preserved
#> [1] "e87304fc-09ed-4634-8caa-a9d9cf2352cc"
#> [2] "149ca4bd-d304-46a8-822b-fe344600d006"The comparison shows exactly which rows were lost (filtered out) and which were preserved.
When appending new data, bind_id() handles ID
conflicts:
batch1 <- data.frame(x = 1:3) |> add_id()
batch2 <- data.frame(x = 4:6) # No IDs yet
# bind_id assigns new IDs to batch2 and checks for conflicts
combined <- bind_id(batch1, batch2)
combined
#> .id x
#> 1 beb82b30-6d2b-4a1a-a952-9b710fcf7f62 1
#> 2 766c5fb1-2c63-4b61-9ce3-c46a80e92cfa 2
#> 3 7aac4ff0-a2f9-4965-abf1-7d1501bbf0b6 3
#> 4 c42f55c3-c5d8-4f76-a95e-9303876450fd 4
#> 5 a4ab668b-4e54-4dc1-bdc2-21686e4944d6 5
#> 6 30995197-d6a4-4938-8904-92358c8c7088 6Goal: Detect when data changes unexpectedly between pipeline runs.
Challenge: Reference data (lookup tables, dimension tables) changes upstream without notice. Your pipeline silently uses stale assumptions.
Strategy: Commit snapshots with
commit_keyed() and check for drift with
check_drift().
# Simulate upstream change: EU tax rate changed
modified_data <- reference_data
modified_data$tax_rate[2] <- 0.21
# Drift detected!
check_drift(modified_data)
#>
#> ── Drift Report
#> ! Drift detected
#> Snapshot: 76a76466... (2026-02-03 22:34)
#> ℹ Key values changed
#> ℹ Cell values modifiedThe drift report shows exactly what changed, letting you decide whether to accept the new data or investigate.
| Function | Purpose |
|---|---|
key() |
Define key columns (validates uniqueness) |
unkey() |
Remove key |
has_key(), get_key_cols() |
Query key status |
| Function | Validates |
|---|---|
lock_unique() |
No duplicate values |
lock_no_na() |
No missing values |
lock_complete() |
All expected values present |
lock_coverage() |
Reference values covered |
lock_nrow() |
Row count within bounds |
| Function | Purpose |
|---|---|
diagnose_join() |
Analyze join cardinality |
compare_keys() |
Compare key structures |
compare_ids() |
Compare row identities |
find_duplicates() |
Find duplicate key values |
key_status() |
Quick status summary |
| Function | Purpose |
|---|---|
add_id() |
Add UUID to rows |
get_id() |
Retrieve row IDs |
bind_id() |
Combine data with ID handling |
make_id() |
Create deterministic IDs from columns |
check_id() |
Validate ID integrity |
| Function | Purpose |
|---|---|
commit_keyed() |
Save reference snapshot |
check_drift() |
Compare against snapshot |
list_snapshots() |
View saved snapshots |
clear_snapshot() |
Remove specific snapshot |
keyed is designed for flat-file workflows without database infrastructure. If you need:
| Need | Better Alternative |
|---|---|
| Enforced schema | Database (SQLite, DuckDB) |
| Version history | Git, git2r |
| Full data validation | pointblank, validate |
| Production pipelines | targets |
keyed fills a specific gap: lightweight key tracking for exploratory and semi-structured workflows where heavier tools add friction.
Design Philosophy - The reasoning behind keyed’s approach
Function Reference - Complete API documentation
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.