From species names to a phylogenetic posterior

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

From species names to a phylogenetic posterior — prepR4pcm + pigauto

Phylogenetic comparative methods (PCMs) propagate uncertainty imperfectly. The standard pipeline takes one tree, fits one model, and reports one set of confidence intervals. But the tree itself is typically a point estimate from a posterior, the trait data typically have missing values, and each of those is its own source of uncertainty.

The two packages compose: prepR4pcm produces multiPhylo, pigauto consumes multiPhylo. Once the contract is set up, swapping the backend (rtrees vs datelife vs fishtree) doesn’t change the rest of the pipeline.

What we’ll build

A model-ready dataset for fish, with:

A posterior of 50 fish chronograms from the Fish Tree of Life (Rabosky et al. 2018).
5 imputations of the missing trait values per tree (so 5 × 50 = 250 model fits — adjust m_per_tree upward for a real analysis).
Pooled estimates with valid standard errors that account for both axes of uncertainty.

library(prepR4pcm)
# Suggests packages we'll lean on:
# install.packages("fishtree")
# pak::pak("phylotastic/datelife")
# Optional, for the pooling step at the end:
# pak::pak("itchyshin/pigauto")

Step 1 — Retrieve a posterior of trees

Start with whatever names appear in your dataset. They never quite match a tree’s tip labels — formatting drifts, synonyms creep in. pr_get_tree() copes: it normalises every name (underscores, whitespace, authority strings) and, for the fishtree backend, resolves synonyms through Open Tree of Life’s Taxonomic Name Resolution Service before querying the backend.

# Your trait data — typically loaded from disk. Names need not be
# tidy: `Esox_lucius` carries an underscore, which pr_get_tree()
# normalises away.
my_data <- data.frame(
  species   = c("Salmo salar", "Esox_lucius", "Oncorhynchus mykiss"),
  body_size = c(98, NA, 50)        # NA is fine; pigauto will impute it
)

Every backend that can return multiple trees does so via n_tree > 1. Here we ask fishtree for 50 stochastic resolutions (Chang, Rabosky, Alfaro 2019, Syst. Biol.).

trees <- pr_get_tree(my_data,
                     species_col = "species",
                     source      = "fishtree",
                     n_tree      = 50)

class(trees$tree)      # "multiPhylo"
length(trees$tree)     # 50
trees$matched          # the 3 species, in their original input form
trees$unmatched        # character(0) — every species was placed

# Per-tree provenance: one list entry per returned tree, each
# recording source_index, citation, calibration_method, and n_tips.
length(trees$backend_meta$tree_provenance)   # 50
trees$backend_meta$tree_provenance[[1]]      # the provenance of tree 1

pr_get_tree() records what happened to every name in trees$mapping (one row per input species), so a name that quietly fell out of the tree cannot go unnoticed.

For the universal-but-slower option, use source = "datelife" — the chronogram-database backend that returns one calibrated tree per source (Hedges, Bininda-Emonds, Upham, etc.).

trees <- pr_get_tree(my_data,
                     species_col = "species",
                     source      = "datelife",
                     n_tree      = 50)
# trees$tree is multiPhylo; per-source citations are in
# trees$backend_meta$tree_provenance[[i]]$citation

If you already have a topology and want to date it (rather than fetch a new one), use pr_date_tree():

my_topology <- ape::read.tree("my_topology.nex")
dated_trees <- pr_date_tree(my_topology, n_dated = 50)
class(dated_trees$tree)        # "multiPhylo"
length(dated_trees$tree)       # up to 50

What n_dated = 50 actually returns: 50 chronograms that all share my_topology but have different branch lengths. Each variant comes from a different source paper in DateLife’s database (e.g. variant 1 dated against Hedges et al. 2015, variant 2 against Bininda-Emonds et al. 2007, etc.). The topology is fixed; the dating varies.

To get 50 different topologies, fetch them separately first (e.g. pr_get_tree(species, source = "rtrees", taxon = "mammal") returns 100 mammal topologies sampled from VertLife / Upham et al. 2019), then pass that multiPhylo to pr_date_tree() to date each one. That gives you the topology × source cross-product.

Step 2 — Reconcile the data to the posterior

pr_get_tree() resolves names well enough to retrieve a tree, but the modelling step needs a data frame whose rows line up one-to-one with the tree’s tips. reconcile_tree() matches your data against the retrieved topology, and reconcile_apply() returns the aligned data–tree pair. The 50 posterior trees share one tip set, so reconciling against the first tree aligns the data to all of them.

rec <- reconcile_tree(my_data, trees$tree[[1]],
                      x_species = "species",
                      authority = NULL)   # tree tips are plain binomials
rec        # match summary: 1 exact, 2 normalized, 0 unresolved

aligned <- reconcile_apply(rec,
                           data            = my_data,
                           tree            = trees$tree[[1]],
                           species_col     = "species",
                           drop_unresolved = TRUE)

aligned$data    # one row per tree tip — the model-ready frame
#>               species body_size
#> 1         Salmo salar        98
#> 2         Esox_lucius        NA
#> 3 Oncorhynchus mykiss        50

aligned$data carries one row per species in the tree, with Esox_lucius keeping the NA that pigauto will impute; aligned$tree is the matching pruned tree.

Step 3 — Format citations (do this before you submit)

Reviewers ask. Save yourself the round-trip.

# Plain text for an email or a methods paragraph
pr_cite_tree(trees)

# Markdown for a GitHub issue or PR description
cat(pr_cite_tree(trees, format = "markdown"))

# BibTeX for a manuscript bibliography (sanity-check before submitting)
cat(pr_cite_tree(trees, format = "bibtex"))

Step 4 — Hand the posterior to pigauto

This is the contract. pigauto::multi_impute_trees(df, trees, m_per_tree) takes df (data with NAs) and trees (multiPhylo) and returns m_per_tree complete-data imputations per tree. The output knows which imputation came from which tree, and pigauto’s pooling step later uses that index.

library(pigauto)

# 5 imputations per tree × 50 trees = 250 complete datasets
mi <- multi_impute_trees(aligned$data,
                         trees      = trees$tree,
                         m_per_tree = 5)

# Fit your model to each completed dataset. Replace the formula
# with whatever your hypothesis is.
fits <- with_imputations(mi, function(df) {
  glmmTMB::glmmTMB(body_size ~ 1 + (1 | species),
                   data = df)
})

# Pool via Rubin's rules — the standard errors reflect BOTH the
# imputation uncertainty AND the tree-posterior uncertainty.
pooled <- pool_mi(fits,
                  coef_fun = function(fit) fixef(fit)$cond,
                  vcov_fun = function(fit) vcov(fit)$cond)

pooled

The reported intervals are wider than the equivalent single-tree-single-imputation pipeline — and correctly so. The narrower intervals were always overconfident.

Choosing a backend

Verified on a clean macOS R 4.4 install on 2026-05-01. See the Comparing tree backends vignette for the full status table.

If your taxon is…	And you want…	Use	Status
Universal	A single best-guess tree	`pr_get_tree(source = "rotl")`	✅ verified — returns the Open Tree of Life synthesis tree
Universal	A posterior of dated trees	`pr_get_tree(source = "datelife", n_tree = 50)`	likely ✅ — `datelife` is in `Enhances`; install separately with `pak::pak("phylotastic/datelife")`
Birds	A single tree, current Clements	`pr_get_tree(source = "clootl", n_tree = 1)`	✅ verified — works out of the box (uses the v1.6/2025 taxonomy bundled with `clootl`)
Birds	A posterior of trees, current Clements	`pr_get_tree(source = "clootl", n_tree = 50)`	⚠️ requires `clootl::get_avesdata_repo(".")` once before this works; without it `sampleTrees()` errors with `AvesData repo not found`. Capped at 100 upstream.
Fish	Time-calibrated, posterior	`pr_get_tree(source = "fishtree", n_tree = 50)`	✅ verified — returns `multiPhylo[50]`
Bird, mammal, fish, amphibian, reptile, plant, shark, bee, butterfly	Taxon-specific mega-tree, possibly grafted	`pr_get_tree(source = "rtrees", taxon = "<group>")`	✅ works; ⚠️ `n_tree` is informational only — `rtrees::get_tree()` has no `n_tree` argument, so the count is fixed by the chosen mega-tree (`taxon = "bird"` → 100, `taxon = "mammal"` → 100, `taxon = "fish"` → 1, etc.).

For the dating-only case (you already have a topology):

If your topology is…	Use	Status
Universal taxa	`pr_date_tree(my_tree, n_dated = 50)`	likely ✅ — untested in this run; same dependency story as `datelife`

What’s not in this pipeline

Tree comparison across backends. prepR4pcm provides pr_tree_compare() for quick pairwise Jaccard / Robinson-Foulds / branch-length comparisons (see the comparing tree backends vignette). For richer visual comparison use phytools::cophyloplot() or phangorn::densiTree().
Trait imputation alone. That’s pigauto’s impute() / multi_impute(). This vignette skips straight to the trees-aware variant because that’s the integration point.

For repeated runs of the same retrieval (e.g. while iterating on a manuscript), pass cache = TRUE to pr_get_tree(). The cache lives under tools::R_user_dir() by default; redirect with pr_tree_cache_dir() if you want it inside the project.