One of the major goals of rotl
is to help users combine data from other sources with the phylogenetic trees in the Open Tree database. This examples demonstrates how a user might connect data they have collected to trees from Open Tree.
Let’s say you have a dataset where each row represents a measurement taken from one species, and your goal is to put these measurements in some phylogenetic context. Here’s a small example: the best estimate of the mutation rate for a set of unicellular Eukaryotes along with some other property of those species which might explain the mutation rate:
csv_path <- system.file("extdata", "protist_mutation_rates.csv", package = "rotl")
mu <- read.csv(csv_path, stringsAsFactors=FALSE)
mu
## species mu pop.size genome.size
## 1 Tetrahymena thermophila 7.61e-12 1.12e+08 1.04e+08
## 2 Paramecium tetraurelia 1.94e-11 1.24e+08 7.20e+07
## 3 Chlamydomonas reinhardtii 2.08e-10 1.00e+08 1.12e+08
## 4 Dictyostelium discoideum 2.90e-11 7.40e+06 3.40e+07
## 5 Saccharomyces cerevisiae 3.30e-10 1.00e+08 1.25e+08
## 6 Saccharomyces pombe 2.00e-10 1.00e+07 1.25e+08
If we want to get a tree for these species we need to start by finding the unique ID for each of these species in the Open Tree database. We can use the Taxonomic Name Resolution Service (tnrs
) functions to do this. Before we do that we should see if any of the taxonomic contexts, which can be used to narrow a search and avoid conflicts between different codes, apply to our group of species:
library(rotl)
tnrs_contexts()
## Possible contexts:
## Animals
## Birds, Tetrapods, Mammals, Amphibians, Vertebrates, Arthropods
## Arthropods, Molluscs, Nematodes, Platyhelminthes, Annelids, Cnidarians
## Cnidarians, Arachnides, Insects
## Bacteria
## SAR group, Archaea, Excavata, Amoebae, Centrohelida, Haptophyta
## Haptophyta, Apusozoa, Diatoms, Ciliates, Forams
## Fungi
## Basidiomycetes, Ascomycetes
## Land plants
## Hornworts, Mosses, Liverworts, Vascular plants, Club mosses, Ferns
## Ferns, Seed plants, Flowering plants, Monocots, Eudicots, Rosids
## Rosids, Asterids, Asterales, Asteraceae, Aster, Symphyotrichum
## Symphyotrichum, Campanulaceae, Lobelia
## All life
Hmm, none of those groups contain all of our species. In this case we can search using the All life
context and the function tnrs_match_names
:
taxon_search <- tnrs_match_names(mu$species, context_name="All life")
knitr::kable(taxon_search)
search_string | unique_name | approximate_match | ott_id | is_synonym | is_deprecated | number_matches |
---|---|---|---|---|---|---|
tetrahymena thermophila | Tetrahymena thermophila | FALSE | 180195 | FALSE | FALSE | 1 |
paramecium tetraurelia | Paramecium tetraurelia | FALSE | 568130 | FALSE | FALSE | 1 |
chlamydomonas reinhardtii | Chlamydomonas reinhardtii | FALSE | 33153 | FALSE | FALSE | 1 |
dictyostelium discoideum | Dictyostelium discoideum | FALSE | 160850 | FALSE | FALSE | 1 |
saccharomyces cerevisiae | Saccharomyces cerevisiae | FALSE | 908549 | FALSE | FALSE | 1 |
saccharomyces pombe | Schizosaccharomyces pombe | FALSE | 990004 | TRUE | FALSE | 1 |
Good, all of our species are known to Open Tree. Note, though, that one of the names is a synonym. Saccharomyces pombe is older name for what is now called Schizosaccharomyces pombe. As the name suggests, the Taxonomic Name Resolution Service is designed to deal with these problems (and similar ones like misspellings), but it is always a good idea to check the results of tnrs_match_names
closely to ensure the results are what you expect.
In this case we have a good ID for each of our species so we can move on. Before we do that, let’s ensure we can match up our original data to the Open Tree names and IDs by adding them to our data.frame
:
mu$ott_name <- taxon_search$unique_name
mu$ott_id <- taxon_search$ott_id
Now let’s find a tree. There are two possible options here: we can search for published studies that include our taxa or we can use the ‘synthetic tree’ from Open Tree. We can try both approaches.
Before we can search for published studies or trees, we should check out the list of properties we can use to perform such searches:
studies_properties()
## $tree_properties
## [1] "ot:treebaseOTUId" "ot:nodeLabelMode"
## [3] "ot:originalLabel" "oti_tree_id"
## [5] "ot:ottTaxonName" "ot:inferenceMethod"
## [7] "ot:tag" "ot:treebaseTreeId"
## [9] "ot:comment" "ot:branchLengthDescription"
## [11] "ot:treeModified" "ot:studyId"
## [13] "ot:branchLengthTimeUnits" "ot:ottId"
## [15] "is_deprecated" "ot:branchLengthMode"
## [17] "ot:treeLastEdited" "ot:nodeLabelDescription"
##
## $study_properties
## [1] "ot:studyModified" "ot:focalClade"
## [3] "ot:focalCladeOTTTaxonName" "ot:focalCladeOTTId"
## [5] "ot:studyPublication" "ot:studyLastEditor"
## [7] "ot:tag" "ot:focalCladeTaxonName"
## [9] "ot:studyLabel" "ot:comment"
## [11] "ot:authorContributed" "ot:studyPublicationReference"
## [13] "ot:curatorName" "ot:studyId"
## [15] "ot:studyUploaded" "ot:studyYear"
## [17] "is_deprecated" "ot:dataDeposit"
We have ottIds
for our taxa, so let’s use those IDs to search for trees that contain them. Starting with our first species Tetrahymena thermophila we can use studies_find_trees
to do this search.
studies_find_trees(property="ot:ottId", value="180195")
## List of Open Tree studies with 0 hits
Well… that’s not very promising. We can repeat that process for all of the IDs to see if the other species are better represented.
hits <- sapply(mu$ott_id, studies_find_trees, property="ot:ottId")
sapply(hits, length)
## 180195.matched_studies 568130.matched_studies 33153.matched_studies
## 0 0 0
## 160850.matched_studies 908549.matched_studies 990004.matched_studies
## 0 22 3
OK, most of our species are not in any of the published trees available. You can help fix this sort of problem by making sure you submit your published trees to Open Tree.
Thankfully, we can still use the complete Tree of Life made from the combined results of all of the published trees and taxonomies that go into Open Tree. The function tol_induced_subtree
will fetch a tree relating a set of IDs.
Using the default arguments you can get a tree object into your R session:
tr <- tol_induced_subtree(ott_ids=mu$ott_id)
plot(tr)
Now we have a tree for of our species, how can we use the tree and the data together?
The package phylobase
provide an object class called phylo4d
, which is designed to represent a phylogeny and data associated with its tips. In oder to get our tree and data into one of these objects we have to make sure the labels in the tree and in our data match exactly. That’s not quite the case at the moment (tree labels have underscores and IDs appended):
mu$ott_name[1]
## [1] "Tetrahymena thermophila"
tr$tip.label[4]
## [1] "Tetrahymena_thermophila_ott180195"
We can use sub
to the remove the underscores and ottId
from the tree (check out ?regex
to see how these patterns work) and %in%
to confirm that each of the modified labels matches a taxon in our data.frame
:
tr$tip.label <- sub("_ott\\d+", "", tr$tip.label)
tr$tip.label <- sub("_", " ", tr$tip.label)
tr$tip.label %in% mu$ott_name
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
Ok, now the tips are together we can make a new dataset. The phylo4d()
functions matches tip lables to the row names of a data.frame
, so let’s make a new dataset that contains just the relevant data and has row names to match the tree
library(phylobase)
## Loading required package: grid
mu_numeric <- mu[,c("mu", "pop.size", "genome.size")]
rownames(mu_numeric) <- mu$ott_name
tree_data <- phylo4d(tr, mu_numeric)
And now we can plot the data and the tree together
plot(tree_data)
The demonstration get’s you to the point of visualizing your data in a phylogenetic context. But there’s a lot more you do with this sort of data in R. For instance, you could use packages like ape
, caper
, phytools
and mcmcGLMM
to perform phylogenetic comparative analyses of your data. You could gather more data on your species using packages that connect to trait databases like rfishbase
, AntWeb
or rnpn
which provides data from the US National Phenology Network. You could also use rentrez
to find genetic data for each of your species, and use that data to generate branch lengths for the phylogeny.