Here, we introduce the R-package POUMM - an implementation of the Phylogenetic Ornstein-Uhlenbeck Mixed Model (POUMM) for univariate continuous traits (Mitov and Stadler 2017). Whenever presented with data consisting of a rooted phylogenetic tree with observed trait-values at its tips, the POUMM package can be used to answer the following questions:
In the first two sections, we demonstrate how the package works. To that end, we run a toy-simulation of a trait according to the POUMM model. Then, we execute a maximum likelihood (ML) and a Bayesian (MCMC) POUMM fit to the simulated data. We show how to use plots and some diagnostics to assess the quality of the fit, i.e. the mixing and the convergence of the MCMC, as well as the consistency of the POUMM fit with the true POUMM parameters from the simulation. In the third section we use variants of the toy-simulation to show how the POUMM can be used to answer each of the questions stated above.
But before we start, we install the needed packages:
install.packages('POUMM')
install.packages("TreeSim")
install.packages("data.table")
install.packages("ggplot2")
install.packages("lmtest")
First, we specify the parameters of the POUMM simulation:
N <- 500
g0 <- 0
alpha <- .5
theta <- 2
sigma <- 0.2
sigmae <- 0.2
We briefly explain the above parameters. The first four of them define an OU-process with initial state \(g_0\), a selection strength parameter, \(\alpha\), a long-term mean, \(\theta\), and a stochastic time-unit standard deviation, \(\sigma\). To get an intuition about the OU-parameters, one can consider random OU-trajectories using the function POUMM::rTrajectoryOU
. On the figure below, notice that doubling \(\alpha\) speeds up the convergence of the trajectory towards \(\theta\) (magenta line) while doubling \(\sigma\) results in bigger stochastic oscilations (blue line):
Dashed black and magenta lines denote the deterministic trend towards the long-term mean \(\theta\), fixing the stochastic parameter \(\sigma=0\).
The POUMM models the evolution of a continuous trait, \(z\), along a phylogenetic tree, assuming that \(z\) is the sum of a genetic (heritable) component, \(g\), and an independent non-heritable (environmental) component, \(e\sim N(0,\sigma_e^2)\). At every branching in the tree, the daughter lineages inherit the \(g\)-value of their parent, adding their own environmental component \(e\). The POUMM assumes the genetic component, \(g\), evolves along each lineage according to an OU-process with initial state the \(g\) value inherited from the parent-lineage and global parameters \(\alpha\), \(\theta\) and \(\sigma\).
Once the POUMM parameters are specified, we use the TreeSim R-package (Stadler 2015) to generate a random birth-death tree with 500 tips:
# Number of tips
tree <- TreeSim::sim.bdsky.stt(N, lambdasky = 1.6, deathsky = .6,
timesky=c(0, Inf), sampprobsky = 1)[[1]]
Starting from the root value \(g_0\), we simulate the genotypic values, \(g\), and the environmental contributions, \(e\), at all internal nodes down to the tips of the phylogeny:
# genotypic (heritable) values
g <- POUMM::rVNodesGivenTreePOUMM(tree, g0, alpha, theta, sigma)
# environmental contributions
e <- rnorm(length(g), 0, sigmae)
# phenotypic values
z <- g + e
In most real situations, only the phenotypic value, at the tips, i.e. will be observable. One useful way to visualize the observed trait-values is to cluster the tips in the tree according to their root-tip distance, and to use box-whisker or violin plots to visualize the trait distribution in each group. This allows to visually assess the trend towards uni-modality and normality of the values - an important prerequisite for the POUMM.
# This is easily done using the nodeTimes utility function in combination with
# the cut-function from the base package.
data <- data.table(z = z[1:N], t = POUMM::nodeTimes(tree, tipsOnly = TRUE))
data <- data[, group := cut(t, breaks = 5, include.lowest = TRUE)]
ggplot(data = data, aes(x = t, y = z, group = group)) +
geom_violin(aes(col = group)) + geom_point(aes(col = group), size=.5)
Distributions of the trait-values grouped according to their root-tip distances.
Once all simulated data is available, it is time proceed with a first POUMM fit. This is done easily by calling the POUMM function:
fitPOUMM <- POUMM::POUMM(z[1:N], tree)
The above code runs for about 5 minutes on a MacBook Pro Retina (late 2013) with a 2.3 GHz Intel Core i7 processor. Using default settings, it performs a maximum likelihood (ML) and a Bayesian (MCMC) fit to the data. First the ML-fit is done. Then, three MCMC chains are run as follows: the first MCMC chain samples from the default prior distribution, i.e. assuming a constant POUMM likelihood; the second and the third chains perform adaptive Metropolis sampling from the posterior parameter distribution conditioned on the default prior and the data. By default each chain is run for \(10^5\) iterations. This and other default POUMM settings are described in detail in the help-page for the function specifyPOUMM (see ).
The strategy of executing three MCMC chains instead of one allows to assess:
We plot traces and posterior sample densities from the MCMC fit:
# get a list of plots
plotList <- plot(fitPOUMM, showUnivarDensityOnDiag = TRUE, doPlot = FALSE)
plotList$traceplot
MCMC traces from a POUMM MCMC-fit.
plotList$densplot
MCMC univariate density plots. Black dots on the x-axis indicate the ML-fit.
A mismatch of the posterior sample density plots from chains 2 and 3, in particular for the phylogenetic heritability, \(H_{\bar{t}}^2\), indicates that the chains have not converged. This can be confirmed quantitatively by the Gelman-Rubin statistic (column called G.R.) in the summary of the fit:
summary(fitPOUMM)
## stat N MLE PostMean HPD ESS G.R.
## 1: alpha 500 0.47443 0.45826 0.3057,0.6263 126.79 1.085
## 2: theta 500 2.04244 2.06446 1.955,2.259 102.41 1.034
## 3: sigma 500 0.20051 0.19904 0.1415,0.2600 85.15 1.140
## 4: sigmae 500 0.20522 0.20398 0.1733,0.2328 173.90 1.162
## 5: H2e 500 0.64112 0.64243 0.5381,0.7441 173.61 1.129
## 6: H2tInf 500 0.50151 0.50735 0.3338,0.6645 181.48 1.120
## 7: H2tMax 500 0.50067 0.50583 0.3327,0.6623 179.65 1.119
## 8: H2tMean 500 0.49919 0.50378 0.3310,0.6599 177.93 1.118
## 9: sigmaG2tMean 500 0.04198 0.04328 0.02443,0.06436 133.52 1.107
## 10: sigmaG2tMax 500 0.04223 0.04363 0.02484,0.06570 134.14 1.106
## 11: sigmaG2tInf 500 0.04237 0.04389 0.02527,0.06710 134.86 1.106
## 12: logpost 500 NA -55.55686 -58.45,-53.52 47.93 1.226
## 13: loglik 500 -53.29615 NA NA,NA 0.00 NA
## 14: AIC 500 116.59230 NA NA,NA 0.00 NA
## 15: AICc 500 116.71376 NA NA,NA 0.00 NA
## 16: g0 500 0.03895 NA NA,NA 0.00 NA
The G.R. diagnostic is used to check whether two random samples originate from the same distribution. Values that are substantially different from 1.00 (in this case greater than 1.01) indicate significant difference between the two samples and possible need to increase the number of MCMC iterations. Therefore, we rerun the fit specifying that each chain should be run for \(4 \times 10^5\) iterations:
fitPOUMM2 <- POUMM::POUMM(z[1:N], tree, spec=list(nSamplesMCMC = 4e5))
Now, both the density plots and the G.R. values indicate nearly perfect convergence of the second and third chains. The agreement between the ML-estimates (black dots on the density plots) and the posterior density modes (approximate location of the peak in the density curves) shows that the prior does not inflict a bias on the MCMC sample. The mismatch between chain 1 and chains 2 and 3 suggests that the information about the POUMM parameters contained in the data disagrees with or significantly improves our prior knowledge about these parameters. This is the desired outcome of a Bayesian fit, in particular, in the case of a weak (non-informed) prior, such as the default one.
plotList <- plot(fitPOUMM2, doPlot = FALSE)
plotList$densplot
summary(fitPOUMM2)
## stat N MLE PostMean HPD ESS G.R.
## 1: alpha 500 0.47443 0.46565 0.3016,0.6189 720.0 0.9992
## 2: theta 500 2.04244 2.05998 1.926,2.197 720.0 0.9993
## 3: sigma 500 0.20051 0.19761 0.1486,0.2682 720.0 0.9997
## 4: sigmae 500 0.20522 0.20639 0.1763,0.2401 841.1 0.9996
## 5: H2e 500 0.64112 0.63443 0.5249,0.7493 941.2 0.9996
## 6: H2tInf 500 0.50151 0.49558 0.3152,0.6729 571.3 0.9996
## 7: H2tMax 500 0.50067 0.49411 0.3135,0.6700 575.2 0.9996
## 8: H2tMean 500 0.49919 0.49218 0.3117,0.6692 578.3 0.9996
## 9: sigmaG2tMean 500 0.04198 0.04236 0.02292,0.06286 548.3 0.9989
## 10: sigmaG2tMax 500 0.04223 0.04270 0.02337,0.06365 543.9 0.9989
## 11: sigmaG2tInf 500 0.04237 0.04297 0.02357,0.06392 538.5 0.9989
## 12: logpost 500 NA -55.63587 -58.71,-53.61 720.0 1.0060
## 13: loglik 500 -53.29615 NA NA,NA 0.0 NA
## 14: AIC 500 116.59230 NA NA,NA 0.0 NA
## 15: AICc 500 116.71376 NA NA,NA 0.0 NA
## 16: g0 500 0.03895 NA NA,NA 0.0 NA
The 95% high posterior density (HPD) intervals contain the true values for all five POUMM parameters (\(\alpha\), \(\theta\), \(\sigma\), \(\sigma_e\) and \(g_0\)). This is also true for the derived statistics. To check this, we calculate the true derived statistics from the true parameter values and check that these are well within the corresponding HPD intervals:
tMean <- mean(POUMM::nodeTimes(tree, tipsOnly = TRUE))
tMax <- max(POUMM::nodeTimes(tree, tipsOnly = TRUE))
c(# phylogenetic heritability at mean root-tip distance:
H2tMean = POUMM::H2(alpha, sigma, sigmae, t = tMean),
# phylogenetic heritability at long term equilibirium:
H2tInf = POUMM::H2(alpha, sigma, sigmae, t = Inf),
# empirical (time-independent) phylogenetic heritability,
H2e = POUMM::H2e(z[1:N], sigmae),
# genotypic variance at mean root-tip distance:
sigmaG2tMean = POUMM::varOU(t = tMean, alpha, sigma),
# genotypic variance at max root-tip distance:
sigmaG2tMean = POUMM::varOU(t = tMax, alpha, sigma),
# genotypic variance at long-term equilibrium:
sigmaG2tInf = POUMM::varOU(t = Inf, alpha, sigma)
)
## H2tMean H2tInf H2e sigmaG2tMean sigmaG2tMean
## 0.49820 0.50000 0.65914 0.03971 0.03990
## sigmaG2tInf
## 0.04000
Finally, we compare the ratio of empirical genotypic to total phenotypic variance with the HPD-interval for the phylogenetic heritability.
c(H2empirical = var(g[1:N])/var(z[1:N]))
## H2empirical
## 0.6791
summary(fitPOUMM2)["H2e"==stat, unlist(HPD)]
## lower upper
## 0.5249 0.7493
On multi-core systems, it is possible to speed-up the POUMM-fit by parallelization. The POUMM package supports parallelization on two levels:
parallel
. With the default settings of the MCMC-fit (executing two MCMC chains sampling from the posterior distribution and one MCMC chain sampling from the prior), this parallelization can result in about two times speed-up of the POUMM fit on a computer with at least two available physical cores. Unless you wish to run more parallel MCMC chains, having more available physical cores would not improve the speed.# set up a parallel cluster on the local computer for parallel MCMC:
cluster <- parallel::makeCluster(parallel::detectCores(logical = FALSE))
doParallel::registerDoParallel(cluster)
fitPOUMM <- POUMM::POUMM(z[1:N], tree, spec=list(parallelMCMC = TRUE))
# Don't forget to destroy the parallel cluster to avoid leaving zombie worker-processes.
parallel::stopCluster(cluster)
Makevars
file found in the directory .R
under the user’s home directory:CFLAGS += -O3 -Wall -pipe -pedantic -std=gnu99
CXXFLAGS += -O3 -Wall -pipe -Wno-unused -pedantic
FC=gfortran
F77=gfortran
MAKE=make -j8
CPP=cpp
CXX=icpc
CC=icc
SHLIB_CXXLD=icpc
Then, before starting R, we can define the maximum number of cores (defaults to all physical cores on the system) by specifying the environment variable OMP_NUM_THREADS, e.g.:
export OMP_NUM_THREADS=4
The first step to answering that question is to visualize the data and check for obvious violations of the POUMM assumptions. The POUMM method expects that the trait-values at the tips are a sample from a multivariate normal distribution. With an ultrametric species tree, where all tips are equally distant from the root, this assumption translates in having all trait-values be realizations of identically distributed normal random variables. In the case of a non-ultrametric tree, it is far more useful to look at a sequence of box-whisker or violin plots of the trait-values, gouped by their root-tip distance.
Once visualizing the data has confirmed its normality, we recommend comaparing the POUMM-fit with a fit from a NULL-model such as the phylogenetic mixed model (PMM) (Housworth, Martins, and Lynch 2004). Since the PMM is nested in the POUMM, i.e. in the limit \(\alpha\to0\), the POUMM model is equivalent to a PMM model with the same initial genotypic value \(g_0\) and unit-time variance \(\sigma\), it is easy to fit a PMM model to the data by fixing the value of the parameter \(\alpha\) to 0:
specPMM <- POUMM::specifyPMM(z[1:N], tree)
fitPMM <- POUMM::POUMM(z[1:N], tree, spec = specPMM, doMCMC=FALSE)
Now a likelihood-ratio test between the maximum likelihood fits clearly shows that the POUMM fits significantly better to the data:
lmtest::lrtest(fitPMM, fitPOUMM2)
## Likelihood ratio test
##
## Model 1: fitPMM
## Model 2: fitPOUMM2
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 3 -112.7
## 2 5 -53.3 2 119 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since lrtest
only uses the ML-fit, to save time, we desabled the MCMC fit by specifying doMCMC = FALSE
. In real situations, though, it is always recommended to enable the MCMC fit, since it can improve the ML-fit if it finds a region of higher likelihood in the parameter space that has not been discovered by the ML-fit.
As an exersise, we can generate data under the PMM model and see if a POUMM fit on that data remains significantly better than a PMM fit:
gBM <- POUMM::rVNodesGivenTreePOUMM(tree, g0, alpha = 0, theta = 0, sigma = sigma)
zBM <- gBM + e
fitPMM_on_zBM <- POUMM::POUMM(zBM[1:N], tree, spec = specPMM, doMCMC = FALSE)
fitPOUMM_on_zBM <- POUMM::POUMM(zBM[1:N], tree, doMCMC = FALSE)
lmtest::lrtest(fitPMM_on_zBM, fitPOUMM_on_zBM)
## Likelihood ratio test
##
## Model 1: fitPMM_on_zBM
## Model 2: fitPOUMM_on_zBM
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 3 -109
## 2 5 -108 2 1.12 0.57
To answer this question, consider the estimated values of the POUMM-parameters \(\theta\) and \(\alpha\). Note that the parameter \(\theta\) is relevant only if the value of the parameter \(\alpha\) is significantly positive. One could accept that the ML-estimate for \(\alpha\) is significantly positive if a likelihood ratio test between a ML PMM and POUMM fits gives a p-value below a critical level (see the question above for an example). An inisignificant value of \(\alpha\) reveals that the hypothesis of neutral drift (Brownian motion) cannot be rejected.
In other words, what is the proportion of observable phenotypic variance attributable to the phylogeny? To answer this question, the POUMM package allows to estimate the phylogenetic heritability of the trait. Assuming that the tree represents the genetic relationship between individuals in a population, \(H_\bar{t}^2\) provides an estimate for the broad-sense heritability \(H^2\) of the trait in the population. The POUMM package reports the following types of phylogenetic heritability (see table for simplified expressions):
When the goal is to estimate \(H_{\bar{t}}^2\) (H2tMean
), it is imortant to specify an uninformed prior for it. Looking at the densities for chain 1 (red) on the previous figures, it becomes clear that the default prior favors values of H2tMean
, which are either close to 0 or close to 1. Since by definition \(H_{\bar{t}}^2\in[0,1]\), a reasonable uninformed prior for it is the standard uniform distribution. We set this prior by using the POUMM::specifyPOUMM_ATH2tMeanSeG0
function. This specifies that the POUMM fit should be done on a parametrization \(<\alpha,\theta,H_{\bar{t}}^2,\sigma_e,g_0>\) rather than the standard parametrization \(<\alpha,\theta,\sigma,\sigma_e,g_0>\). It also specifies a uniform prior for \(H_{\bar{t}}^2\). You can explore the members of the specification list to see the different settings:
specH2tMean <- POUMM::specifyPOUMM_ATH2tMeanSeG0(z[1:N], tree, nSamplesMCMC = 4e5)
# Mapping from the sampled parameters to the standard POUMM parameters:
specH2tMean$parMapping
## function (par)
## {
## if (is.matrix(par)) {
## par[, 3] <- POUMM::sigmaOU(par[, 3], par[, 1], par[,
## 4], tMean)
## colnames(par) <- c("alpha", "theta", "sigma", "sigmae",
## "g0")
## }
## else {
## par[3] <- POUMM::sigmaOU(par[3], par[1], par[4], tMean)
## names(par) <- c("alpha", "theta", "sigma", "sigmae",
## "g0")
## }
## par
## }
## <bytecode: 0x7ffc602a7990>
## <environment: 0x7ffc602aec30>
# Prior for the MCMC sampling
specH2tMean$parPriorMCMC
## function (par)
## {
## dexp(par[1], rate = tMean/6.931, log = TRUE) + dnorm(par[2],
## zMean, 2 * zSD, TRUE) + dunif(par[3], min = 0, max = 1,
## log = TRUE) + dexp(par[4], rate = 2/zSD, log = TRUE) +
## dnorm(par[5], zMean, 2 * zSD, log = TRUE)
## }
## <bytecode: 0x7ffc602ab698>
## <environment: 0x7ffc602aec30>
# Bounds for the maximum likelihood search
specH2tMean$parLower
## alpha theta H2tMean sigmae g0
## 0.0000 -4.6686 0.0000 0.0000 -0.3944
specH2tMean$parUpper
## alpha theta H2tMean sigmae g0
## 14.0440 7.7297 0.9900 0.6851 3.4555
Then we fit the model:
fitH2tMean <- POUMM::POUMM(z[1:N], tree, spec = specH2tMean)
plot(fitH2tMean, stat = c("H2tMean", "H2e", "H2tInf", "sigmae"),
showUnivarDensityOnDiag = TRUE,
doZoomIn = TRUE, doPlot = TRUE)
summary(fitH2tMean)[stat %in% c("H2tMean", "H2e", "H2tInf", "sigmae")]
## stat N MLE PostMean HPD ESS G.R.
## 1: H2tMean 500 0.4992 0.5061 0.3299,0.6742 720 1.0023
## 2: sigmae 500 0.2052 0.2062 0.1767,0.2380 720 1.0008
## 3: H2e 500 0.6411 0.6356 0.5206,0.7370 720 0.9999
## 4: H2tInf 500 0.5015 0.5104 0.3357,0.6759 720 1.0024
Now we see that the prior density for H2tMean
is nearly uniform. It becomes clear that the process has converged to its long-term heritability since the intervals for H2tMean
and H2tInf
are nearly the same. Notice, though, that the estimate for the empirical heritability H2e
is shifted towards 1 compared to H2tMean
and H2tInf
. This shows an important difference between H2e
and the time-dependent formulae for phylogenetic heritability: H2e
takes into account all values of z including those at the very beginning when the process was far away from equilibrium. Thus the estimated phenotypic variance over all trait-values at all times can be substantially bigger compared to the current trait-variance in the population:
# Compare global empirical heritability
H2eGlobal <- POUMM::H2e(z[1:N], sigmae = coef(fitH2tMean)['sigmae'])
# versus recent empirical heritability
H2eRecent <- POUMM::H2e(z[1:N], tree, sigmae = coef(fitH2tMean)['sigmae'], tFrom = 5)
print(c(H2eGlobal, H2eRecent))
## [1] 0.6411 0.5011
To learn more about different ways to specify the POUMM fit, read the documentation page ?POUMM::specifyPOUMM_ATH2tMeanSeG0
.
Apart from base R functionality, the POUMM package uses a number of 3rd party R-packages:
Bates, Douglas, and Martin Maechler. 2017. Matrix: Sparse and Dense Matrix Classes and Methods. https://CRAN.R-project.org/package=Matrix.
Dowle, Matt, and Arun Srinivasan. 2016. Data.table: Extension of ‘Data.frame‘. https://CRAN.R-project.org/package=data.table.
Duncan Murdoch, Robin K. S. Hankin; qrng functions by, and multimin by Andrew Clausen. 2017. Gsl: Wrapper for the Gnu Scientific Library. https://CRAN.R-project.org/package=gsl.
Eddelbuettel, Dirk, Romain Francois, JJ Allaire, Kevin Ushey, Qiang Kou, Nathan Russell, Douglas Bates, and John Chambers. 2017. Rcpp: Seamless R and C++ Integration. https://CRAN.R-project.org/package=Rcpp.
Eddelbuettel, Dirk, Romain Francois, and Doug Bates. 2016. RcppArmadillo: ’Rcpp’ Integration for the ’Armadillo’ Templated Linear Algebra Library. https://CRAN.R-project.org/package=RcppArmadillo.
Genz, Alan, Frank Bretz, Tetsuhisa Miwa, Xuefei Mi, and Torsten Hothorn. 2016. Mvtnorm: Multivariate Normal and T Distributions. https://CRAN.R-project.org/package=mvtnorm.
Hothorn, Torsten, Achim Zeileis, Richard W. Farebrother, and Clint Cummins. 2015. Lmtest: Testing Linear Regression Models. https://CRAN.R-project.org/package=lmtest.
Housworth, Elizabeth A, Emília P Martins, and Michael Lynch. 2004. “The phylogenetic mixed model.” The American Naturalist 163 (1): 84–96.
Maechler, Martin. 2016. Rmpfr: R Mpfr - Multiple Precision Floating-Point Reliable. https://CRAN.R-project.org/package=Rmpfr.
Mitov, Venelin, and Tanja Stadler. 2017. “Fast and Robust Inference of Phylogenetic Ornstein-Uhlenbeck Models Using Parallel Likelihood Calculation.” BioRxiv, mai, 115089.
Paradis, Emmanuel, Simon Blomberg, Ben Bolker, Julien Claude, Hoa Sien Cuong, Richard Desper, Gilles Didier, et al. 2016. Ape: Analyses of Phylogenetics and Evolution. https://CRAN.R-project.org/package=ape.
Plummer, Martyn, Nicky Best, Kate Cowles, Karen Vines, Deepayan Sarkar, Douglas Bates, Russell Almond, and Arni Magnusson. 2016. Coda: Output Analysis and Diagnostics for Mcmc. https://CRAN.R-project.org/package=coda.
Revolution Analytics, and Steve Weston. n.d. Foreach: Provides Foreach Looping Construct for R.
Scheidegger, Andreas. 2012. AdaptMCMC: Implementation of a Generic Adaptive Monte Carlo Markov Chain Sampler. https://CRAN.R-project.org/package=adaptMCMC.
Schloerke, Barret, Jason Crowley, Di Cook, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Joseph Larmarange. 2016. GGally: Extension to ’Ggplot2’. https://CRAN.R-project.org/package=GGally.
Stadler, Tanja. 2015. TreeSim: Simulating Phylogenetic Trees. https://CRAN.R-project.org/package=TreeSim.
Team, R Core. n.d. Support for Parallel Computation in R.
Wickham, Hadley. 2016. Testthat: Unit Testing for R. https://CRAN.R-project.org/package=testthat.
Wickham, Hadley, and Winston Chang. 2016. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.