The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Comparative analyses often begin well before model fitting. Trait data are commonly assembled from multiple databases, each with its own species names, column names, measurement conventions, and taxonomic coverage. A phylogenetic tree adds one more requirement: the final trait table must contain species that can be matched to the tree, and the table and tree must be aligned before they can be used in a phylogenetic model.
This vignette focuses on that database-assembly step. We combine three mammal trait datasets, reconcile their species names with a phylogenetic tree, and collapse the matched records into a species-level table. The goal is not to fit a model here, but to show how several databases and a tree can be brought into a single, clean object ready for downstream comparative analyses.
The worked example draws on five packages: dplyr for the table manipulations, readr for tolerant number parsing, stringr for tidying raw species strings, ape for the phylogeny, and prepR4pcm for reconciling species names against the tree.
The package bundles three small source tables and one small tree. The source tables were sampled from real database structures, using only the columns needed for this example. They mirror three common inputs for mammal comparative work: the Amniote life-history database (Myhrvold et al. 2015), PanTHERIA (Jones et al. 2009), and TetrapodTraits (Moura et al. 2024). The bundled tree is a subset of the VertLife mammal phylogeny (Upham et al. 2019); the next code chunk reports its exact tip count. For analysis-grade trees download the full credible set from https://vertlife.org/phylosubsets/.
data(mammal_amniote_example)
data(mammal_pantheria_example)
data(mammal_tetrapodtraits_example)
data(mammal_tree_example)
cat(sprintf("Amniote-like source: %d rows\n", nrow(mammal_amniote_example)))
#> Amniote-like source: 4953 rows
cat(sprintf("PanTHERIA-like source: %d rows\n", nrow(mammal_pantheria_example)))
#> PanTHERIA-like source: 5416 rows
cat(sprintf("TetrapodTraits-like source: %d rows\n", nrow(mammal_tetrapodtraits_example)))
#> TetrapodTraits-like source: 5911 rows
cat(sprintf("Tree: %d tips\n", ape::Ntip(mammal_tree_example)))
#> Tree: 5987 tipsThe three sources contain related information, but they do not use the same column names. This is typical when assembling trait data from independent databases.
source_columns <- tibble::tibble(
source = c("Amniote", "PanTHERIA", "TetrapodTraits"),
n_rows = c(
nrow(mammal_amniote_example),
nrow(mammal_pantheria_example),
nrow(mammal_tetrapodtraits_example)
),
n_columns = c(
ncol(mammal_amniote_example),
ncol(mammal_pantheria_example),
ncol(mammal_tetrapodtraits_example)
),
species_column = c("name", "MSW05_Binomial", "Scientific.Name")
)
knitr::kable(source_columns)| source | n_rows | n_columns | species_column |
|---|---|---|---|
| Amniote | 4953 | 5 | name |
| PanTHERIA | 5416 | 4 | MSW05_Binomial |
| TetrapodTraits | 5911 | 3 | Scientific.Name |
We first convert each source into the same long-format structure. The
result keeps one row per source record, plus a source
column so the provenance of each value is retained.
amniote_std <- prep_source(
mammal_amniote_example,
source_name = "AMNIOTE",
species_col = "name",
female_mass_col = "female_body_mass_g",
adult_mass_col = "adult_body_mass_g",
litter_size_col = "litter_or_clutch_size_n",
litter_y_col = "litters_or_clutches_per_y"
)
pantheria_std <- prep_source(
mammal_pantheria_example,
source_name = "PANTHERIA",
species_col = "MSW05_Binomial",
adult_mass_col = "5-1_AdultBodyMass_g",
litter_size_col = "15-1_LitterSize",
litter_y_col = "16-1_LittersPerYear"
)
tetrapodtraits_std <- prep_source(
mammal_tetrapodtraits_example,
source_name = "TETRAPODTRAITS",
species_col = "Scientific.Name",
adult_mass_col = "BodyMass_g",
litter_size_col = "LitterSize"
)
db_long_raw <- bind_rows(
amniote_std,
pantheria_std,
tetrapodtraits_std
)
knitr::kable(slice_head(db_long_raw, n = 10))| source | row_in_source | species | female_mass_g | adult_mass_g | litter_size_n | litters_per_year_n |
|---|---|---|---|---|---|---|
| AMNIOTE | 1 | Amblysomus_corriae | NA | 64.80 | 1.93 | 2 |
| AMNIOTE | 2 | Amblysomus_hottentotus | NA | 64.80 | 1.93 | 2 |
| AMNIOTE | 3 | Amblysomus_marleyi | NA | 64.80 | 1.93 | 2 |
| AMNIOTE | 4 | Amblysomus_robustus | NA | 64.80 | 1.93 | 2 |
| AMNIOTE | 5 | Amblysomus_septentrionalis | NA | 64.80 | 1.93 | 2 |
| AMNIOTE | 6 | Calcochloris_obtusirostris | NA | 24.05 | 2.00 | NA |
| AMNIOTE | 7 | Carpitalpa_arendsi | NA | 52.34 | NA | NA |
| AMNIOTE | 8 | Chlorotalpa_duthieae | NA | 31.34 | 2.00 | NA |
| AMNIOTE | 9 | Chlorotalpa_sclateri | NA | 38.30 | 2.00 | NA |
| AMNIOTE | 10 | Chrysochloris_asiatica | NA | 36.93 | 3.12 | NA |
We can now check how much information each source contributes.
source_coverage <- db_long_raw |>
group_by(source) |>
summarise(
n_records = n(),
n_species = n_distinct(species),
adult_mass_records = sum(!is.na(adult_mass_g)),
female_mass_records = sum(!is.na(female_mass_g)),
litter_size_records = sum(!is.na(litter_size_n)),
litter_y_records = sum(!is.na(litters_per_year_n)),
.groups = "drop"
)
knitr::kable(source_coverage)| source | n_records | n_species | adult_mass_records | female_mass_records | litter_size_records | litter_y_records |
|---|---|---|---|---|---|---|
| AMNIOTE | 4953 | 4953 | 4651 | 438 | 3511 | 2146 |
| PANTHERIA | 5416 | 5416 | 3542 | 0 | 2501 | 894 |
| TETRAPODTRAITS | 5911 | 5911 | 5911 | 0 | 3547 | 0 |
In this table n_records is the number of rows the source
contributes and n_species the number of distinct species;
each *_records column counts the rows where that trait has
a non-missing, positive value. Those counts sit well below
n_records — no single database measures every trait for
every species, which is exactly why combining sources is worthwhile.
Name reconciliation is done on the unique source names, not on every row of the trait database. This creates one matching decision per raw species name.
species_lookup <- db_long_raw |>
distinct(species) |>
rename(species_raw = species)
knitr::kable(slice_head(species_lookup, n = 10))| species_raw |
|---|
| Amblysomus_corriae |
| Amblysomus_hottentotus |
| Amblysomus_marleyi |
| Amblysomus_robustus |
| Amblysomus_septentrionalis |
| Calcochloris_obtusirostris |
| Carpitalpa_arendsi |
| Chlorotalpa_duthieae |
| Chlorotalpa_sclateri |
| Chrysochloris_asiatica |
We now reconcile the source names against the tree. External synonym lookup is turned off here so the vignette remains fast and reproducible.
rec0 <- reconcile_tree(
x = species_lookup,
tree = mammal_tree_example,
x_species = "species_raw",
authority = NULL,
fuzzy = FALSE,
quiet = TRUE
)
reconcile_summary(rec0, detail = "brief")
#>
#> === Reconciliation Report ===
#> Type: data_tree
#> Timestamp: 2026-06-20 12:44:49
#> Package: prepR4pcm 1.0.0
#> Authority: NONE (version: latest)
#> Rank: species
#>
#> --- Match Summary ---
#> Exact: 4677 / 11304
#> Normalized: 1234 / 11304
#> Synonym: 0 / 11304
#> Fuzzy: 0 / 11304
#> Manual: 0 / 11304
#> Unresolved: 5393 (x only) + 76 (y only)
#> The mapping table records which source names matched the tree and
which need review. In it, name_x is the raw source name,
name_y the tree tip it resolved to, match_type
records how the two were linked, and in_x /
in_y flag whether the name is present in the data and in
the tree.
mapping0 <- reconcile_mapping(rec0)
mapping_preview <- mapping0 |>
select(any_of(c("name_x", "name_y", "match_type", "in_x", "in_y"))) |>
arrange(match_type, name_x) |>
slice_head(n = 15)
knitr::kable(mapping_preview)| name_x | name_y | match_type | in_x | in_y |
|---|---|---|---|---|
| Abditomys_latidens | Abditomys_latidens | exact | TRUE | TRUE |
| Abeomelomys_sevia | Abeomelomys_sevia | exact | TRUE | TRUE |
| Abrawayaomys_ruschii | Abrawayaomys_ruschii | exact | TRUE | TRUE |
| Abrocoma_bennettii | Abrocoma_bennettii | exact | TRUE | TRUE |
| Abrocoma_boliviensis | Abrocoma_boliviensis | exact | TRUE | TRUE |
| Abrocoma_budini | Abrocoma_budini | exact | TRUE | TRUE |
| Abrocoma_cinerea | Abrocoma_cinerea | exact | TRUE | TRUE |
| Abrocoma_famatina | Abrocoma_famatina | exact | TRUE | TRUE |
| Abrocoma_shistacea | Abrocoma_shistacea | exact | TRUE | TRUE |
| Abrocoma_vaccarum | Abrocoma_vaccarum | exact | TRUE | TRUE |
| Abrothrix_andinus | Abrothrix_andinus | exact | TRUE | TRUE |
| Abrothrix_hershkovitzi | Abrothrix_hershkovitzi | exact | TRUE | TRUE |
| Abrothrix_illuteus | Abrothrix_illuteus | exact | TRUE | TRUE |
| Abrothrix_jelskii | Abrothrix_jelskii | exact | TRUE | TRUE |
| Abrothrix_lanosus | Abrothrix_lanosus | exact | TRUE | TRUE |
Names that remain unresolved or flagged can be inspected separately. We also ask for suggested matches. These suggestions are not applied automatically; they are candidates for manual review.
review_names <- mapping0 |>
filter(in_x, match_type %in% c("unresolved", "flagged")) |>
arrange(match_type, name_x)
if (nrow(review_names) == 0) {
cat("No unresolved or flagged names in this example.\n")
} else {
cat(sprintf(
"Showing 10 of %d unresolved or flagged names.\n\n",
nrow(review_names)
))
knitr::kable(slice_head(review_names, n = 10) |>
select(any_of(c("name_x", "name_y", "match_type",
"in_x", "in_y"))))
}
#> Showing 10 of 5393 unresolved or flagged names.| name_x | name_y | match_type | in_x | in_y |
|---|---|---|---|---|
| Abditomys latidens | NA | unresolved | TRUE | FALSE |
| Abeomelomys sevia | NA | unresolved | TRUE | FALSE |
| Abrawayaomys ruschii | NA | unresolved | TRUE | FALSE |
| Abrocoma bennettii | NA | unresolved | TRUE | FALSE |
| Abrocoma boliviensis | NA | unresolved | TRUE | FALSE |
| Abrocoma budini | NA | unresolved | TRUE | FALSE |
| Abrocoma cinerea | NA | unresolved | TRUE | FALSE |
| Abrocoma famatina | NA | unresolved | TRUE | FALSE |
| Abrocoma shistacea | NA | unresolved | TRUE | FALSE |
| Abrocoma vaccarum | NA | unresolved | TRUE | FALSE |
suggestions0 <- reconcile_suggest(rec0, n = 3, threshold = 0.9)
#> ✔ Found suggestions for 4683 of 5393 unresolved species.
suggestions_to_review <- suggestions0 |>
transmute(
name_x = unresolved,
name_y = suggestion,
score = score
) |>
filter(score >= 0.9, score < 1) |>
arrange(desc(score))
if (nrow(suggestions_to_review) == 0) {
cat("No high-confidence, non-perfect suggestions were found.\n")
} else {
cat(
"Showing up to 10 high-confidence suggested matches with score below 1.\n\n",
sep = ""
)
knitr::kable(slice_head(suggestions_to_review, n = 10), digits = 3)
}
#> Showing up to 10 high-confidence suggested matches with score below 1.| name_x | name_y | score |
|---|---|---|
| Tamiops mcclellandii | Tamiops_macclellandii | 0.969 |
| Tamiops_mcclellandii | Tamiops_macclellandii | 0.969 |
| Rattus arfakiensis | Rattus_arfakienis | 0.964 |
| Rattus_arfakiensis | Rattus_arfakienis | 0.964 |
| Galeopterus variegates | Galeopterus_variegatus | 0.960 |
| Galeopterus_variegates | Galeopterus_variegatus | 0.960 |
| Herpestes edwardsi | Herpestes_edwardsii | 0.956 |
| Herpestes_edwardsi | Herpestes_edwardsii | 0.956 |
| Neophascogale lorentzi | Neophascogale_lorentzii | 0.956 |
| Neophascogale_lorentzi | Neophascogale_lorentzii | 0.956 |
Automated reconciliation is useful, but some names still need human review. The suggestion table above helps identify likely matches. Manual corrections are stored as a small editable table.
The important rule is simple: name_x must be a name from
the trait database, and name_y must be an exact tip label
in the tree.
manual_overrides <- suggestions_to_review |>
slice_head(n = 2) |>
mutate(user_note = "Accepted from high-confidence reconciliation suggestion") |>
select(name_x, name_y, user_note)
if (nrow(manual_overrides) == 0) {
cat("No manual corrections were added in this example.\n")
} else {
knitr::kable(manual_overrides, digits = 3)
}| name_x | name_y | user_note |
|---|---|---|
| Tamiops mcclellandii | Tamiops_macclellandii | Accepted from high-confidence reconciliation suggestion |
| Tamiops_mcclellandii | Tamiops_macclellandii | Accepted from high-confidence reconciliation suggestion |
We then apply any manual corrections to the reconciliation table. If
manual_overrides is empty, the automated mapping is kept
unchanged.
mapping_final <- mapping0 |>
left_join(
manual_overrides |>
rename(manual_name_y = name_y, manual_note = user_note),
by = "name_x"
) |>
mutate(
species_tree = coalesce(manual_name_y, name_y),
matched_to_tree = species_tree %in% mammal_tree_example$tip.label,
match_type = if_else(!is.na(manual_name_y), "manual", match_type),
notes = manual_note
) |>
select(-manual_name_y, -manual_note)final_reconciliation_summary <- mapping_final |>
filter(in_x) |>
count(match_type, name = "n_names") |>
arrange(desc(n_names), match_type)
knitr::kable(final_reconciliation_summary)| match_type | n_names |
|---|---|
| unresolved | 5391 |
| exact | 4677 |
| normalized | 1234 |
| manual | 2 |
The corrected mapping can now be joined back to the full source-level trait table.
name_map <- mapping_final |>
filter(in_x) |>
transmute(
species_raw = name_x,
species_tree = species_tree,
matched_to_tree = matched_to_tree,
match_type = match_type,
notes = notes
)
db_full <- db_long_raw |>
rename(species_raw = species) |>
left_join(name_map, by = "species_raw") |>
relocate(source, row_in_source, species_raw, species_tree,
matched_to_tree, match_type)
db_tree_matched <- db_full |>
filter(matched_to_tree, !is.na(species_tree))
knitr::kable(
db_tree_matched |>
select(source, species_raw, species_tree, match_type,
adult_mass_g, female_mass_g, litter_size_n,
litters_per_year_n) |>
slice_head(n = 10),
digits = 3
)| source | species_raw | species_tree | match_type | adult_mass_g | female_mass_g | litter_size_n | litters_per_year_n |
|---|---|---|---|---|---|---|---|
| AMNIOTE | Amblysomus_corriae | Amblysomus_corriae | exact | 64.80 | NA | 1.93 | 2 |
| AMNIOTE | Amblysomus_hottentotus | Amblysomus_hottentotus | exact | 64.80 | NA | 1.93 | 2 |
| AMNIOTE | Amblysomus_marleyi | Amblysomus_marleyi | exact | 64.80 | NA | 1.93 | 2 |
| AMNIOTE | Amblysomus_robustus | Amblysomus_robustus | exact | 64.80 | NA | 1.93 | 2 |
| AMNIOTE | Amblysomus_septentrionalis | Amblysomus_septentrionalis | exact | 64.80 | NA | 1.93 | 2 |
| AMNIOTE | Calcochloris_obtusirostris | Calcochloris_obtusirostris | exact | 24.05 | NA | 2.00 | NA |
| AMNIOTE | Carpitalpa_arendsi | Carpitalpa_arendsi | exact | 52.34 | NA | NA | NA |
| AMNIOTE | Chlorotalpa_duthieae | Chlorotalpa_duthieae | exact | 31.34 | NA | 2.00 | NA |
| AMNIOTE | Chlorotalpa_sclateri | Chlorotalpa_sclateri | exact | 38.30 | NA | 2.00 | NA |
| AMNIOTE | Chrysochloris_asiatica | Chrysochloris_asiatica | exact | 36.93 | NA | 3.12 | NA |
The source-level database can now be summarised to one record per matched species. Here we use the median trait value across available source records and keep simple provenance columns.
db_species_summary <- db_tree_matched |>
group_by(species_tree) |>
summarise(
n_sources_total = n_distinct(source),
sources = safe_sources(source),
adult_mass_g = safe_median(adult_mass_g),
female_mass_g = safe_median(female_mass_g),
litter_size_n = safe_median(litter_size_n),
litters_per_year_n = safe_median(litters_per_year_n),
adult_mass_n_records = sum(!is.na(adult_mass_g)),
female_mass_n_records = sum(!is.na(female_mass_g)),
litter_size_n_records = sum(!is.na(litter_size_n)),
litter_y_n_records = sum(!is.na(litters_per_year_n)),
.groups = "drop"
) |>
mutate(annual_offspring_n = litter_size_n * litters_per_year_n)trait_coverage <- db_species_summary |>
summarise(
n_species = n(),
adult_mass_species = sum(!is.na(adult_mass_g)),
female_mass_species = sum(!is.na(female_mass_g)),
litter_size_species = sum(!is.na(litter_size_n)),
litters_per_year_species= sum(!is.na(litters_per_year_n)),
annual_offspring_species= sum(!is.na(annual_offspring_n))
)
knitr::kable(trait_coverage)| n_species | adult_mass_species | female_mass_species | litter_size_species | litters_per_year_species | annual_offspring_species |
|---|---|---|---|---|---|
| 5911 | 5634 | 395 | 3446 | 2000 | 1975 |
Each *_species column counts the species with a
non-missing value for that trait. Some species still carry
NA: even with three sources combined, not every species has
every trait measured. Downstream comparative models handle these gaps
through imputation or complete-case analysis.
For phylogenetic comparative models, the data and tree must refer to the same species. Here we prune the tree and order the data rows to match the tree tips.
matched_tips <- intersect(
mammal_tree_example$tip.label,
db_species_summary$species_tree
)
tree_pcm <- keep.tip(mammal_tree_example, matched_tips)
pcm_data <- db_species_summary |>
filter(species_tree %in% tree_pcm$tip.label) |>
mutate(species = species_tree) |>
arrange(match(species, tree_pcm$tip.label)) |>
relocate(species)
stopifnot(identical(pcm_data$species, tree_pcm$tip.label))alignment_check <- tibble::tibble(
object = c("pcm_data", "tree_pcm"),
species_or_tips = c(nrow(pcm_data), ape::Ntip(tree_pcm)),
aligned = c(
identical(pcm_data$species, tree_pcm$tip.label),
identical(pcm_data$species, tree_pcm$tip.label)
)
)
knitr::kable(alignment_check)| object | species_or_tips | aligned |
|---|---|---|
| pcm_data | 5911 | TRUE |
| tree_pcm | 5911 | TRUE |
The final objects are pcm_data and
tree_pcm. The table below shows the first 10 rows of the
assembled database. This is the object that would be passed to
downstream phylogenetic comparative models.
knitr::kable(
pcm_data |>
select(species, adult_mass_g, litter_size_n,
litters_per_year_n, annual_offspring_n,
n_sources_total, sources) |>
slice_head(n = 10),
digits = 3
)| species | adult_mass_g | litter_size_n | litters_per_year_n | annual_offspring_n | n_sources_total | sources |
|---|---|---|---|---|---|---|
| Zaglossus_bartoni | 8951.71 | 1 | 1.00 | 1.00 | 1 | AMNIOTE |
| Zaglossus_bruijnii | 16500.00 | 1 | NA | NA | 1 | TETRAPODTRAITS |
| Zaglossus_attenboroughi | 3000.00 | NA | NA | NA | 1 | AMNIOTE |
| Tachyglossus_aculeatus | 3169.50 | 1 | 0.58 | 0.58 | 1 | AMNIOTE |
| Ornithorhynchus_anatinus | 1225.00 | 2 | 1.10 | 2.20 | 1 | AMNIOTE |
| Hydrodamalis_gigas | 8950000.00 | 1 | NA | NA | 2 | AMNIOTE; TETRAPODTRAITS |
| Trichechus_manatus | 387500.00 | 1 | 0.40 | 0.40 | 1 | AMNIOTE |
| Trichechus_senegalensis | 477000.00 | 1 | NA | NA | 1 | AMNIOTE |
| Trichechus_inunguis | 294000.59 | 1 | 0.50 | 0.50 | 1 | AMNIOTE |
| Dugong_dugon | 360000.00 | 1 | 0.22 | 0.22 | 1 | AMNIOTE |
The helper functions used in this vignette
(prep_source(), pull_number_or_na(),
safe_sources(), safe_median()) are local to
this document; they are deliberately minimal so you can copy and adapt
them for your own assembly script. If a generic version proves useful in
practice we can graduate them to exported package functions.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.