The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Before running any phylogenetic comparative analysis — PGLS, phylogenetic mixed models, ancestral state reconstruction — species names in your data must match the tip labels in your tree. In practice, they rarely do. prepR4pcm automates the matching of species names between data and tree (which we call reconciliation), records every name-matching decision so you can audit it later, and produces an aligned data frame + pruned tree (the aligned objects) where the species lists match exactly — the precondition for any phylogenetic comparative method.
Mismatches between data and tree arise from three kinds of difference:
Homo_sapiens in the tree and as Homo sapiens
in the data; trailing whitespace and attached authority strings
(Homo sapiens Linnaeus, 1758) cause similar
mismatches.Fixing these by hand is tedious, error-prone, and poorly documented.
prepR4pcm solves this with a structured matching
cascade of algorithms: exact match → normalised match → synonym
resolution. Every decision is recorded by the software in the
reconciliation result, where you can inspect it via
reconcile_mapping() or
reconcile_summary().
Suppose you have trait data and a phylogenetic tree with slightly different naming conventions.
# Simulated trait data for 6 primate species
trait_data <- data.frame(
species = c(
"Homo sapiens",
"Pan_troglodytes", # underscore instead of space
"Gorilla gorilla",
"Pongo pygmaeus",
"Macaca mulatta",
"Cebus capucinus"
),
body_mass = c(70, 50, 160, 80, 8, 3),
brain_mass = c(1.35, 0.39, 0.50, 0.37, 0.11, 0.07)
)
# Simulated phylogenetic tree (built manually for this example)
tree <- ape::read.tree(text = paste0(
"((((Homo_sapiens:5,Pan_troglodytes:5):3,",
"Gorilla_gorilla:8):4,Pongo_pygmaeus:12):6,",
"(Macaca_mulatta:10,Papio_anubis:10):8);"
))
tree$tip.label # the tip labels (species names) on the tree
#> [1] "Homo_sapiens" "Pan_troglodytes" "Gorilla_gorilla" "Pongo_pygmaeus"
#> [5] "Macaca_mulatta" "Papio_anubis"
plot(tree) # quick visual; underscores in tip labels render as spaces(ape::plot.phylo() displays underscores as spaces by
default — the underlying tree$tip.label strings still
contain underscores, which is why tree$tip.label shows
them.)
Notice the mismatches:
Pan_troglodytes in the data has an underscore; the tree
uses underscores throughout, but the data column mixes spaces and
underscores.Cebus capucinus is in the data but not in the
tree.Papio anubis is in the tree but not in the data.result <- reconcile_tree(
x = trait_data,
tree = tree,
x_species = "species",
authority = NULL, # skip synonym lookup for this example
quiet = FALSE
)
#> ℹ Reconciling 6 data names vs 6 tree tips
#> ✔ Matched 5/6 data names to tree tipsprint(result)
#>
#> ── Reconciliation: data vs tree ────────────────────────────────────────────────
#> Source x: trait_data
#> Source y: phylo (6 tips)
#> Authority: none
#> Timestamp: 2026-06-20 12:44:55
#> ℹ Match coverage: [█████████████████████████░░░░░] 83% (5/6)
#>
#> ── Match summary ──
#>
#> • Exact: 1 (16.7%)
#> • Normalized: 4 (66.7%)
#> • Synonym: 0 ( 0.0%)
#> • Fuzzy: 0 ( 0.0%)
#> • Manual: 0 ( 0.0%)
#> ! Unresolved (x only):1 (16.7%)
#> ! Unresolved (y only):1
#> ! Flagged for review: 0
#> ℹ Use `reconcile_summary()` for details, `reconcile_mapping()` for the full table.The “Reconciliation: data vs tree” header at the top of the output
tells you the call that produced the result; the “Match summary” block
underneath gives the count in each match category (exact, normalised,
synonym, fuzzy, manual, unresolved). Use
reconcile_mapping() to see the full per-name table:
reconcile_mapping(result)
#> # A tibble: 7 × 9
#> name_x name_y name_resolved match_type match_score match_source in_x in_y
#> <chr> <chr> <chr> <chr> <dbl> <chr> <lgl> <lgl>
#> 1 Pan_trog… Pan_t… <NA> exact 1 exact_string TRUE TRUE
#> 2 Homo sap… Homo_… <NA> normalized 1 normalisati… TRUE TRUE
#> 3 Gorilla … Goril… <NA> normalized 1 normalisati… TRUE TRUE
#> 4 Pongo py… Pongo… <NA> normalized 1 normalisati… TRUE TRUE
#> 5 Macaca m… Macac… <NA> normalized 1 normalisati… TRUE TRUE
#> 6 Cebus ca… <NA> <NA> unresolved NA <NA> TRUE FALSE
#> 7 <NA> Papio… <NA> unresolved NA <NA> FALSE TRUE
#> # ℹ 1 more variable: notes <chr>What the columns mean:
name_x — the species name as it appeared in your
data (the argument x to
reconcile_tree()).name_y — the matching tip label on your
tree (the argument tree to
reconcile_tree()), or NA if no match was
found.name_resolved — the canonical name used when synonym
resolution applied (the recognised form per the chosen taxonomic
authority). NA for matches that didn’t go through the
synonym stage.match_type — which stage of the cascade matched the
name (see Understanding match types below).match_score — confidence on [0, 1]
(1 for exact / normalised / synonym / manual;
< 1 for fuzzy / flagged).in_x, in_y — logical: was this name in the
data, in the tree, or both?notes — human-readable note (e.g. “normalised:
lowercased”, “via synonym lookup against COL”, “fuzzy match score
0.92”).For a detailed report:
Suppose you know that Cebus capucinus should not be in
the analysis. You can document this decision:
result <- reconcile_override(
result,
name_x = "Cebus capucinus",
name_y = NA,
action = "reject",
note = "Not in target phylogeny; exclude from analysis"
)
#> ✔ Override applied: 'Cebus capucinus' -> 'NA' (reject)reconcile_override() updates the existing
result (the reconciliation you built earlier)
in place — no need to re-run reconcile_tree(). The three
actions you can pass to action = ... are:
"accept" — confirm a specific
name_x → name_y mapping."reject" — mark a name as deliberately excluded."replace" — redirect name_x to a different
name_y than the cascade produced.Once satisfied with the reconciliation, apply it:
aligned <- reconcile_apply(
result,
data = trait_data,
tree = tree,
species_col = "species",
drop_unresolved = TRUE
)
#> ! Dropped 1 rows with unresolved species from data
#> ℹ Tree has 5 tips after alignment
# Aligned data frame — only species present in both data and tree
aligned$data
#> species body_mass brain_mass
#> 1 Homo sapiens 70 1.35
#> 2 Pan_troglodytes 50 0.39
#> 3 Gorilla gorilla 160 0.50
#> 4 Pongo pygmaeus 80 0.37
#> 5 Macaca mulatta 8 0.11
# Aligned tree — pruned to matched species
ape::Ntip(aligned$tree)
#> [1] 5
plot(aligned$tree) # the pruned treeThe $data and $tree components now have
matching species, ready for comparative analysis.
prepR4pcm can also reconcile species names between
two datasets, not just between a dataset and a tree. The same
matching cascade applies. This is useful when merging trait data from
different sources, where species names often disagree across datasets.
Here is a toy example:
# df1: body mass for three primates (df1 uses an underscore for chimp)
df1 <- data.frame(
species = c("Homo sapiens", "Pan_troglodytes", "Gorilla gorilla"),
mass = c(70, 50, 160)
)
# df2: lifespan for three primates (df2 uses a space for chimp; orang
# is here but not gorilla)
df2 <- data.frame(
species = c("Homo sapiens", "Pan troglodytes", "Pongo pygmaeus"),
lifespan = c(79, 40, 45)
)
# Reconcile the species columns of df1 and df2 against each other.
# `authority = NULL` skips the synonym-lookup stage (no taxonomic
# database needed for this small example). `quiet = TRUE` suppresses
# progress messages.
result2 <- reconcile_data(
x = df1,
y = df2,
authority = NULL,
quiet = TRUE
)
#> ℹ Auto-detected species column: species
#> ℹ Auto-detected species column: species
# The output shows how many names matched, and via which stage.
print(result2)
#>
#> ── Reconciliation: data vs data ────────────────────────────────────────────────
#> Source x: df1
#> Source y: df2
#> Authority: none
#> Timestamp: 2026-06-20 12:44:56
#> ℹ Match coverage: [████████████████████░░░░░░░░░░] 67% (2/3)
#>
#> ── Match summary ──
#>
#> • Exact: 1 (33.3%)
#> • Normalized: 1 (33.3%)
#> • Synonym: 0 ( 0.0%)
#> • Fuzzy: 0 ( 0.0%)
#> • Manual: 0 ( 0.0%)
#> ! Unresolved (x only):1 (33.3%)
#> ! Unresolved (y only):1
#> ! Flagged for review: 0
#> ℹ Use `reconcile_summary()` for details, `reconcile_mapping()` for the full table.Pan_troglodytes (underscore) in df1 is
matched to Pan troglodytes (space) in df2 via
normalisation. Gorilla gorilla is in df1 only
and Pongo pygmaeus is in df2 only — both end
up as unresolved rows
(in_x = TRUE, in_y = FALSE and vice versa).
Every row in the reconcile_mapping() output has a
match_type column. Here is what each value means and what
action (if any) it requires:
match_type |
Meaning | Action needed? |
|---|---|---|
exact |
Verbatim string equality | None |
normalized |
Names matched after stripping underscores, authority strings, and case differences | None — check the notes column if you want to
confirm |
synonym |
Names resolved through a taxonomic authority (e.g., Catalogue of Life) to the same accepted name | Verify the resolved name looks correct |
fuzzy |
High-confidence character-level match (score ≥
flag_threshold, default 0.95) |
Check the match_score column; review with
reconcile_suggest() |
flagged |
Lower-confidence match that needs human review: fuzzy score below
flag_threshold, or an indirect synonym chain |
Review with reconcile_review() or
reconcile_suggest() |
manual |
Set by reconcile_override() or the
overrides argument |
None — you decided this |
unresolved |
No match found after all stages | Investigate; use reconcile_suggest() for candidates or
reconcile_override() to document a decision |
Use
reconcile_summary(result, detail = "mismatches_only") to
see only the rows that need attention.
Researchers often maintain a curated list of known corrections. You can pass these as a data frame, or as a path to a file in CSV format:
The chunks below use
my_dataandmy_treeas hypothetical objects (substitute your own data frame andphyloobject). They are markedeval = FALSEso the vignette renders without requiring those objects to exist.
# A data frame of known corrections
corrections <- data.frame(
name_x = c("Corvus sp.", "Turdus merulaa"),
name_y = c("Corvus corax", "Turdus merula"),
user_note = c("Only one Corvus in our tree", "Typo in source data")
)
result4 <- reconcile_tree(
x = my_data,
tree = my_tree,
overrides = corrections
)
# Or from a CSV file:
result5 <- reconcile_tree(
x = my_data,
tree = my_tree,
overrides = "lab_corrections.csv"
)Overrides are applied before any other matching stage, so they always take priority.
reconcile_multi() reconciles several datasets at once,
pooling all unique species names before running the cascade:
# Suppose you have several data frames to reconcile against one tree.
# `my_ecology_data`, `my_morpho_data`, and `my_tree` are **hypothetical**
# user-supplied objects; substitute your own.
datasets <- list(
traits = trait_data, # defined above
ecology = my_ecology_data, # your own data frame
morpho = my_morpho_data # your own data frame
)
result6 <- reconcile_multi(datasets, my_tree)
print(result6)data.frame of trait values (one
row per species) and a phylogenetic tree as an ape::phylo
object.The chunk below uses hypothetical files (
species_traits.csv,species_tree.nwk) — substitute your own paths. The chunk is markedeval = FALSEso it doesn’t try to read files that don’t exist when the vignette is rendered.
library(prepR4pcm)
# 1. Load your data and tree (hypothetical paths -- substitute your own)
my_data <- read.csv("species_traits.csv")
my_tree <- ape::read.tree("species_tree.nwk")
# 2. Reconcile
result <- reconcile_tree(my_data, my_tree, authority = "col")
# 3. Review
print(result)
reconcile_summary(result, detail = "mismatches_only")
# 4. Fix manually if needed
result <- reconcile_override(result, "Corvus sp.", "Corvus corax",
note = "Only one Corvus in tree")
# 5. Apply
aligned <- reconcile_apply(result, data = my_data, tree = my_tree,
drop_unresolved = TRUE)
# 6. Analyse
# aligned$data and aligned$tree are ready for caper, phytools, MCMCglmm, etc.These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.