The myTAI
package provides analytics tools for datasets fulfilling the PhyloExpressionSet and DivergenceExpressionSet standard. To obtain this data format a PhyloExpressionSet or DivergenceExpressionSet resembles the combination of a Phylostratigraphic Map and an Expressionset (PhyloExpressionSet) or the combination of a Divergence Map and an Expressionset (DivergenceExpressionSet).
The computation of a Phylostratigraphic Map relies on a method named Phylostratigraphy. The computation of a Divergence Map relies on a method named Divergence Stratigraphy. Both methods are computationally expensive and include many methodologies and evolutionary concepts. Nevertheless, the orthologr package aims to automate Divergence Stratigraphy and can be used to obtain a Divergence Map for a query organism of interest.
A more detailed description of Divergence-Stratigraphy can be found in the Divergence-Stratigraphy Vignette that is included in orthologr.
# install orthologr from GitHub
# install.packages("devtools")
# install the current version of orthologr on your system
library(devtools)
install_github("HajkD/orthologr", build_vignettes = TRUE, dependencies = TRUE)
# On Windows, this won't work - see ?build_github_devtools
install_github("HajkD/orthologr", build_vignettes = TRUE, dependencies = TRUE)
# When working with Windows, first you need to install the
# R package: rtools -> install.packages("rtools")
# Afterwards you can install devtools -> install.packages("devtools")
# and then you can run:
devtools::install_github("HajkD/orthologr", build_vignettes = TRUE, dependencies = TRUE)
# and then call it from the library
library("orthologr", lib.loc = "C:/Program Files/R/R-3.1.1/library")
A divergence map quantifies for each protein coding gene of a given organism the degree of selection pressure. The selection pressure is quantified by dNdS estimation.
To perform divergence stratigraphy
using orthologr
you need the following prerequisites
In the following example, we will use Arabidopsis thaliana as query organism and Arabidopsis lyrata as subject organism.
First, we need to download the CDS sequences for all protein coding genes of A. thaliana and A. lyrata.
The CDS retrieval can be done using a Terminal
or by manual downloading the files
Arabidopsis_thaliana.TAIR10.23.cds.all.fa.gz
Arabidopsis_lyrata.v.1.0.23.cds.all.fa.gz
# download CDS file of A. thaliana
curl ftp://ftp.ensemblgenomes.org/pub/
plants/release-23/fasta/arabidopsis_thaliana/
cds/Arabidopsis_thaliana.TAIR10.23.cds.all.fa.gz
-o Arabidopsis_thaliana.TAIR10.23.cds.all.fa.gz
# download CDS file of A. lyrata
curl ftp://ftp.ensemblgenomes.org/pub/plants/
release-23/fasta/arabidopsis_lyrata/cds/
Arabidopsis_lyrata.v.1.0.23.cds.all.fa.gz
-o Arabidopsis_lyrata.v.1.0.23.cds.all.fa.gz
Alternatively, you can use the Biological Data Retrieval package biomartr to download proteomes from the refseq database (see Sequence Retrieval Vignette for details).
# install.packages("devtools")
# install the current version of biomartr on your system
library(devtools)
install_github("HajkD/biomartr", build_vignettes = TRUE, dependencies = TRUE)
# On Windows, this won't work - see ?build_github_devtools
install_github("HajkD/biomartr", build_vignettes = TRUE, dependencies = TRUE)
# When working with Windows, first you need to install the
# R package: rtools -> install.packages("rtools")
# Afterwards you can install devtools -> install.packages("devtools")
# and then you can run:
devtools::install_github("HajkD/biomartr", build_vignettes = TRUE, dependencies = TRUE)
# and then call it from the library
library("biomartr", lib.loc = "C:/Program Files/R/R-3.1.1/library")
# download the proteome of Arabidopsis thaliana from refseq
# and store the corresponding proteome file in '_ncbi_downloads/proteomes'
Ath_Proteome <- getProteome( db = "refseq",
kingdom = "plant",
organism = "Arabidopsis thaliana",
clean_folder = FALSE )
Internally, the getProteome()
function creates a directory named _ncbi_downloads/proteome
in which corresponding proteomes are loaded and then sourced as data.table object into the current R session. When specifying clean_folder = FALSE
, the _ncbi_downloads/proteomes
folder will not be removed and the corresponding proteome does not need to be downloaded again when sourcing it into the current R session via getProteome()
.
When the download is finished you need to unzip the files and then start R to perform the following analyses:
library(orthologr)
# compute the divergence map of A. thaliana
Athaliana_DM <- divergence_stratigraphy(
query_file = "path/to/Arabidopsis_thaliana.TAIR10.23.cds.all.fa",
subject_file = "path/to/Arabidopsis_lyrata.v.1.0.23.cds.all.fa",
eval = "1E-5",
ortho_detection = "RBH",
comp_cores = 1,
quiet = TRUE,
clean_folders = TRUE )
Note, that you can specify the comp_cores
argument in case you work with an multicore machine.
The next step is to combine the Divergence Map
of A. thaliana (Athaliana_DM
) with an gene expression set covering a biological process of interest (in our case A. thaliana embryogenesis). We obtain an example gene expression set covering A. thaliana embryogenesis from the ExpressionMatrix stored in PhyloExpressionSetExample
. This results in an standard DivergenceExpressionSet object.
# load the PhyloExpressionSetExample data set
data(PhyloExpressionSetExample)
# get the ExpressionMatrix covering A. thaliana embryogenesis.
ExprMatrix <- PhyloExpressionSetExample[ , 2:9]
# match the divergence map with the gene expression set of A. thaliana
# to obtain an PhyloExpressionSet object
Ath_PhyloExpressionSet <- MatchMap( Map = Athaliana_DM,
ExpressionMatrix = ExprMatrix )
This way you can create any PhyloExpressionSet of interest. In this example, the output of Ath_PhyloExpressionSet
should be analogous to PhyloExpressionSetExample
.
# load the PhyloExpressionSetExample data set
data(PhyloExpressionSetExample)
# look at PhyloExpressionSetExample
head(PhyloExpressionSetExample)
Phylostratum GeneID Zygote Quadrant Globular Heart Torpedo Bent Mature
1 1 at1g01040.2 2173.6352 1911.2001 1152.5553 1291.4224 1000.2529 962.9772 1696.4274
2 1 at1g01050.1 1501.0141 1817.3086 1665.3089 1564.7612 1496.3207 1114.6435 1071.6555
3 1 at1g01070.1 1212.7927 1233.0023 939.2000 929.6195 864.2180 877.2060 894.8189
4 1 at1g01080.2 1016.9203 936.3837 1181.3381 1329.4734 1392.6429 1287.9746 861.2605
5 1 at1g01090.1 11424.5667 16778.1685 34366.6493 39775.6405 56231.5689 66980.3673 7772.5617
6 1 at1g01120.1 844.0414 787.5929 859.6267 931.6180 942.8453 870.2625 792.7542