The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Quick Start

Gilles Colling

2026-02-03

The Problem: Silent Data Corruption

You receive monthly customer exports from a CRM system. The data should have unique customer_id values and complete email addresses. One month, someone upstream changes the export logic. Now customer_id has duplicates and some emails are missing.

Without explicit checks, you won’t notice until something breaks downstream—wrong row counts after a join, duplicated invoices, failed email campaigns.

# January export: clean data
january <- data.frame(
  customer_id = c(101, 102, 103, 104, 105),
  email = c("alice@example.com", "bob@example.com", "carol@example.com",
            "dave@example.com", "eve@example.com"),
  segment = c("premium", "basic", "premium", "basic", "premium")
)

# February export: corrupted upstream (duplicates + missing email)
february <- data.frame(
  customer_id = c(101, 102, 102, 104, 105),  # Note: 102 is duplicated

  email = c("alice@example.com", "bob@example.com", NA,
            "dave@example.com", "eve@example.com"),
  segment = c("premium", "basic", "basic", "basic", "premium")
)

The February data looks fine at a glance:

head(february)
#>   customer_id             email segment
#> 1         101 alice@example.com premium
#> 2         102   bob@example.com   basic
#> 3         102              <NA>   basic
#> 4         104  dave@example.com   basic
#> 5         105   eve@example.com premium
nrow(february)  # Same row count
#> [1] 5

But it will silently corrupt your analysis.


The Solution: Make Assumptions Explicit

keyed catches these issues by making your assumptions explicit:

# Define what you expect: customer_id is unique
january_keyed <- january |>
  key(customer_id) |>
  lock_no_na(email)

# This works - January data is clean
january_keyed
#> # A keyed tibble: 5 x 3
#> # Key:            customer_id
#>   customer_id email             segment
#>         <dbl> <chr>             <chr>  
#> 1         101 alice@example.com premium
#> 2         102 bob@example.com   basic  
#> 3         103 carol@example.com premium
#> 4         104 dave@example.com  basic  
#> 5         105 eve@example.com   premium

Now try the same with February’s corrupted data:

# This fails immediately - duplicates detected
february |>
  key(customer_id)
#> Warning: Key is not unique.
#> ℹ 1 duplicate key value(s) found.
#> ℹ Key columns: customer_id
#> # A keyed tibble: 5 x 3
#> # Key:            customer_id
#>   customer_id email             segment
#>         <dbl> <chr>             <chr>  
#> 1         101 alice@example.com premium
#> 2         102 bob@example.com   basic  
#> 3         102 <NA>              basic  
#> 4         104 dave@example.com  basic  
#> 5         105 eve@example.com   premium

The error catches the problem at import time, not downstream when you’re debugging a mysterious row count mismatch.


Workflow 1: Monthly Data Validation

Goal: Validate each month’s export against expected constraints before processing.

Challenge: Data quality varies month-to-month. Silent corruption causes cascading errors.

Strategy: Define keys and assumptions once, apply consistently to each import.

Define validation function

validate_customer_export <- function(df) {
  df |>
    key(customer_id) |>
    lock_no_na(email) |>
    lock_nrow(min = 1)
}

# January: passes
january_clean <- validate_customer_export(january)
summary(january_clean)
#> 
#> ── Keyed Data Frame Summary
#> Dimensions: 5 rows x 3 columns
#> 
#> Key columns: customer_id
#> ✔ Key is unique
#> 
#> Row IDs: none

Keys survive transformations

Once defined, keys persist through dplyr operations:

# Filter preserves key
premium_customers <- january_clean |>
  filter(segment == "premium")

has_key(premium_customers)
#> [1] TRUE
get_key_cols(premium_customers)
#> [1] "customer_id"

# Mutate preserves key
enriched <- january_clean |>
  mutate(domain = sub(".*@", "", email))

has_key(enriched)
#> [1] TRUE

Strict enforcement

If an operation breaks uniqueness, keyed errors and tells you to use unkey() first:

# This creates duplicates - keyed stops you
january_clean |>
  mutate(customer_id = 1)
#> Error in `mutate()`:
#> ! Key is no longer unique after transformation.
#> ℹ Use `unkey()` first if you intend to break uniqueness.

To proceed, you must explicitly acknowledge breaking the key:

january_clean |>
  unkey() |>
  mutate(customer_id = 1)
#> # A tibble: 5 × 3
#>   customer_id email             segment
#>         <dbl> <chr>             <chr>  
#> 1           1 alice@example.com premium
#> 2           1 bob@example.com   basic  
#> 3           1 carol@example.com premium
#> 4           1 dave@example.com  basic  
#> 5           1 eve@example.com   premium

Workflow 2: Safe Joins

Goal: Join customer data with orders without accidentally duplicating rows.

Challenge: Join cardinality mistakes are common and hard to debug. A “one-to-one” join that’s actually one-to-many silently inflates your data.

Strategy: Use diagnose_join() to understand cardinality before joining.

Create sample data

customers <- data.frame(
  customer_id = 1:5,
  name = c("Alice", "Bob", "Carol", "Dave", "Eve"),
  tier = c("gold", "silver", "gold", "bronze", "silver")
) |>
  key(customer_id)

orders <- data.frame(
  order_id = 1:8,
  customer_id = c(1, 1, 2, 3, 3, 3, 4, 5),
  amount = c(100, 150, 200, 50, 75, 125, 300, 80)
) |>
  key(order_id)

Diagnose before joining

diagnose_join(customers, orders, by = "customer_id", use_joinspy = FALSE)
#> 
#> ── Join Diagnosis
#> Cardinality: one-to-many
#> x: 5 rows, unique
#> y: 8 rows, 3 duplicates

The diagnosis shows:

Now you know what to expect. A left_join() will create 8 rows (one per order), not 5 (one per customer).

Compare key structures

compare_keys(customers, orders)
#> 
#> ── Key Comparison
#> Comparing on: customer_id
#> 
#> x: 5 unique keys
#> y: 5 unique keys
#> 
#> Common: 5 (100.0% of x)
#> Only in x: 0
#> Only in y: 0

This shows the join key exists in both tables but with different uniqueness properties—essential information before joining.


Workflow 3: Row Identity Tracking

Goal: Track which original rows survive through a complex pipeline.

Challenge: After filtering, aggregating, and joining, you lose track of which source rows contributed to your final data.

Strategy: Use add_id() to attach stable identifiers that survive transformations.

Add row IDs

# Add UUIDs to rows
customers_tracked <- customers |>
  add_id()

customers_tracked
#> # A keyed tibble: 5 x 4
#> # Key:            customer_id | .id
#>   .id                                  customer_id name  tier  
#>   <chr>                                      <int> <chr> <chr> 
#> 1 e87304fc-09ed-4634-8caa-a9d9cf2352cc           1 Alice gold  
#> 2 d4c8b392-666d-43e1-8178-ce8e01efd218           2 Bob   silver
#> 3 149ca4bd-d304-46a8-822b-fe344600d006           3 Carol gold  
#> 4 d6031d5d-90eb-44f7-96a8-db4a0ef6da72           4 Dave  bronze
#> 5 2aad2788-779c-48fb-86de-8dfd71222a2c           5 Eve   silver

IDs survive transformations

# Filter: IDs persist
gold_customers <- customers_tracked |>
  filter(tier == "gold")

get_id(gold_customers)
#> [1] "e87304fc-09ed-4634-8caa-a9d9cf2352cc"
#> [2] "149ca4bd-d304-46a8-822b-fe344600d006"

# Compare with original
compare_ids(customers_tracked, gold_customers)
#> $lost
#> [1] "d4c8b392-666d-43e1-8178-ce8e01efd218"
#> [2] "d6031d5d-90eb-44f7-96a8-db4a0ef6da72"
#> [3] "2aad2788-779c-48fb-86de-8dfd71222a2c"
#> 
#> $gained
#> character(0)
#> 
#> $preserved
#> [1] "e87304fc-09ed-4634-8caa-a9d9cf2352cc"
#> [2] "149ca4bd-d304-46a8-822b-fe344600d006"

The comparison shows exactly which rows were lost (filtered out) and which were preserved.

Combining data with ID handling

When appending new data, bind_id() handles ID conflicts:

batch1 <- data.frame(x = 1:3) |> add_id()
batch2 <- data.frame(x = 4:6)  # No IDs yet

# bind_id assigns new IDs to batch2 and checks for conflicts
combined <- bind_id(batch1, batch2)
combined
#>                                    .id x
#> 1 beb82b30-6d2b-4a1a-a952-9b710fcf7f62 1
#> 2 766c5fb1-2c63-4b61-9ce3-c46a80e92cfa 2
#> 3 7aac4ff0-a2f9-4965-abf1-7d1501bbf0b6 3
#> 4 c42f55c3-c5d8-4f76-a95e-9303876450fd 4
#> 5 a4ab668b-4e54-4dc1-bdc2-21686e4944d6 5
#> 6 30995197-d6a4-4938-8904-92358c8c7088 6

Workflow 4: Drift Detection

Goal: Detect when data changes unexpectedly between pipeline runs.

Challenge: Reference data (lookup tables, dimension tables) changes upstream without notice. Your pipeline silently uses stale assumptions.

Strategy: Commit snapshots with commit_keyed() and check for drift with check_drift().

Commit a reference snapshot

# Commit current state as reference
reference_data <- data.frame(
  region_id = c("US", "EU", "APAC"),
  tax_rate = c(0.08, 0.20, 0.10)
) |>
  key(region_id) |>
  commit_keyed()
#> ✔ Snapshot committed: 76a76466...

Check for drift

# No changes yet
check_drift(reference_data)
#> 
#> ── Drift Report
#> ✔ No drift detected
#> Snapshot: 76a76466... (2026-02-03 22:34)

Detect changes

# Simulate upstream change: EU tax rate changed
modified_data <- reference_data
modified_data$tax_rate[2] <- 0.21

# Drift detected!
check_drift(modified_data)
#> 
#> ── Drift Report
#> ! Drift detected
#> Snapshot: 76a76466... (2026-02-03 22:34)
#> ℹ Key values changed
#> ℹ Cell values modified

The drift report shows exactly what changed, letting you decide whether to accept the new data or investigate.

Cleanup

# Remove snapshots when done
clear_all_snapshots()
#> ! This will remove 1 snapshot(s) from cache.
#> ✔ Cleared 1 snapshot(s).

Quick Reference

Core Functions

Function Purpose
key() Define key columns (validates uniqueness)
unkey() Remove key
has_key(), get_key_cols() Query key status

Assumption Checks

Function Validates
lock_unique() No duplicate values
lock_no_na() No missing values
lock_complete() All expected values present
lock_coverage() Reference values covered
lock_nrow() Row count within bounds

Diagnostics

Function Purpose
diagnose_join() Analyze join cardinality
compare_keys() Compare key structures
compare_ids() Compare row identities
find_duplicates() Find duplicate key values
key_status() Quick status summary

Row Identity

Function Purpose
add_id() Add UUID to rows
get_id() Retrieve row IDs
bind_id() Combine data with ID handling
make_id() Create deterministic IDs from columns
check_id() Validate ID integrity

Drift Detection

Function Purpose
commit_keyed() Save reference snapshot
check_drift() Compare against snapshot
list_snapshots() View saved snapshots
clear_snapshot() Remove specific snapshot

When to Use Something Else

keyed is designed for flat-file workflows without database infrastructure. If you need:

Need Better Alternative
Enforced schema Database (SQLite, DuckDB)
Version history Git, git2r
Full data validation pointblank, validate
Production pipelines targets

keyed fills a specific gap: lightweight key tracking for exploratory and semi-structured workflows where heavier tools add friction.


See Also

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.