apex implements some new classes to handle DNA sequences from different genes and individuals. It implements new classes extending object classes from ape and phangorn to store multiple gene data, and some useful wrappers mimicking existing functionalities for multiple genes. This document provides an overview of the package's content.
To install the development version from github:
library(devtools)
install_github("thibautjombart/apex")
The stable version can be installed from CRAN using:
install.packages("apex")
Then, to load the package, use:
library("apex")
Two new classes of object extend existing data structure for multiple genes:
DNAbin
classphyDat
classThis formal (S4) class can be seen as a multi-gene extension of ape's DNAbin
class.
Data is stored as a list of DNAbin objects, with additional slots for extra information.
The class definition can be obtained by:
getClassDef("multidna")
DNAbin
matrices, with matching rowsdna
)Any of these slots can be accessed using @
(see example below).
New multidna
objects can be created via two ways:
new("multidna", ...)
We illustrate the use of the constructor below (see ?new.multidna
) for more information.
We use ape's dataset woodmouse, which we artificially split in two 'genes', keeping the first 500 nucleotides for the first gene, and using the rest as second gene. Note that the individuals need not match across different genes: matching is handled by the constructor.
## empty object
new("multidna")
## === multidna ===
## [ 0 DNA sequence in 0 gene ]
##
## @n.ind: 0 individual
## @n.seq: 0 sequence in total
## @labels:
## using a list of genes as input
data(woodmouse)
genes <- list(gene1=woodmouse[,1:500], gene2=woodmouse[,501:965])
x <- new("multidna", genes)
x
## === multidna ===
## [ 30 DNA sequences in 2 genes ]
##
## @n.ind: 15 individuals
## @n.seq: 30 sequences in total
## @labels: No305 No304 No306 No0906S No0908S No0909S...
##
## @dna:
## $gene1
## 15 DNA sequences in binary format stored in a matrix.
##
## All sequences of same length: 500
##
## Labels: No305 No304 No306 No0906S No0908S No0909S ...
##
## Base composition:
## a c g t
## 0.326 0.230 0.147 0.297
##
## $gene2
## 15 DNA sequences in binary format stored in a matrix.
##
## All sequences of same length: 465
##
## Labels: No305 No304 No306 No0906S No0908S No0909S ...
##
## Base composition:
## a c g t
## 0.286 0.295 0.103 0.316
## access the various slots
x@labels
## [1] "No305" "No304" "No306" "No0906S" "No0908S" "No0909S" "No0910S"
## [8] "No0912S" "No0913S" "No1103S" "No1007S" "No1114S" "No1202S" "No1206S"
## [15] "No1208S"
x@n.ind
## [1] 15
class(x@dna) # this is a list
## [1] "list"
names(x@dna) # names of the genes
## [1] "gene1" "gene2"
x@dna[[1]] # first gene
## 15 DNA sequences in binary format stored in a matrix.
##
## All sequences of same length: 500
##
## Labels: No305 No304 No306 No0906S No0908S No0909S ...
##
## Base composition:
## a c g t
## 0.326 0.230 0.147 0.297
x@dna[[2]] # second gene
## 15 DNA sequences in binary format stored in a matrix.
##
## All sequences of same length: 465
##
## Labels: No305 No304 No306 No0906S No0908S No0909S ...
##
## Base composition:
## a c g t
## 0.286 0.295 0.103 0.316
## compare the input dataset and the new multidna
par(mfrow=c(3,1), mar=c(6,6,2,1))
image(woodmouse)
image(x@dna[[1]])
image(x@dna[[2]])
## same but with missing sequences and wrong order
genes <- list(gene1=woodmouse[,1:500], gene2=woodmouse[c(5:1,14:15),501:965])
x <- new("multidna", genes)
x
## === multidna ===
## [ 22 DNA sequences in 2 genes ]
##
## @n.ind: 15 individuals
## @n.seq: 22 sequences in total
## @labels: No305 No304 No306 No0906S No0908S No0909S...
##
## @dna:
## $gene1
## 15 DNA sequences in binary format stored in a matrix.
##
## All sequences of same length: 500
##
## Labels: No305 No304 No306 No0906S No0908S No0909S ...
##
## Base composition:
## a c g t
## 0.326 0.230 0.147 0.297
##
## $gene2
## 15 DNA sequences in binary format stored in a matrix.
##
## All sequences of same length: 465
##
## Labels: No305 No304 No306 No0906S No0908S No0909S ...
##
## Base composition:
## a c g t
## 0.286 0.294 0.103 0.316
par(mar=c(6,6,2,1))
plot(x)
Two simple functions permit to import data from multiple alignements into multidna
objects:
Both functions rely on the single-gene counterparts in ape and accept the same arguments. Each file should contain data from a given gene, where sequences should be named after individual labels only. Here is an example using a dataset from apex:
## get address of the file within apex
files <- dir(system.file(package="apex"),patter="patr", full=TRUE)
files # this will change on your computer
## [1] "/home/thibaut/dev/apex/inst/patr_poat43.fasta"
## [2] "/home/thibaut/dev/apex/inst/patr_poat47.fasta"
## [3] "/home/thibaut/dev/apex/inst/patr_poat48.fasta"
## [4] "/home/thibaut/dev/apex/inst/patr_poat49.fasta"
## read these files
x <- read.multiFASTA(files)
x
## === multidna ===
## [ 24 DNA sequences in 4 genes ]
##
## @n.ind: 8 individuals
## @n.seq: 24 sequences in total
## @labels: 2340_50156.ab1 2340_50149.ab1 2340_50674.ab1 2370_45312.ab1 2340_50406.ab1 2370_45424.ab1 ...
##
## @dna:
## $patr_poat43
## 8 DNA sequences in binary format stored in a matrix.
##
## All sequences of same length: 764
##
## Labels: 2340_50156.ab1 2340_50149.ab1 2340_50674.ab1 2370_45312.ab1 2340_50406.ab1 2370_45424.ab1 ...
##
## Base composition:
## a c g t
## 0.320 0.158 0.166 0.356
##
## $patr_poat47
## 8 DNA sequences in binary format stored in a matrix.
##
## All sequences of same length: 626
##
## Labels: 2340_50156.ab1 2340_50149.ab1 2340_50674.ab1 2370_45312.ab1 2340_50406.ab1 2370_45424.ab1 ...
##
## Base composition:
## a c g t
## 0.227 0.252 0.256 0.266
##
## $patr_poat48
## 8 DNA sequences in binary format stored in a matrix.
##
## All sequences of same length: 560
##
## Labels: 2340_50156.ab1 2340_50149.ab1 2340_50674.ab1 2370_45312.ab1 2340_50406.ab1 2370_45424.ab1 ...
##
## Base composition:
## a c g t
## 0.305 0.185 0.182 0.327
##
## $patr_poat49
## 8 DNA sequences in binary format stored in a matrix.
##
## All sequences of same length: 556
##
## Labels: 2340_50156.ab1 2340_50149.ab1 2340_50674.ab1 2370_45312.ab1 2340_50406.ab1 2370_45424.ab1 ...
##
## Base composition:
## a c g t
## 0.344 0.149 0.187 0.320
names(x@dna) # names of the genes
## [1] "patr_poat43" "patr_poat47" "patr_poat48" "patr_poat49"
par(mar=c(6,11,2,1))
plot(x)
Additionally:
read.phyDat
in phangorn:z <- read.multiphyDat(files, format="fasta")
z
## An object of class "multiphyDat"
## Slot "dna":
## $patr_poat43
## 5 sequences with 764 character and 8 different site patterns.
## The states are a c g t
##
## $patr_poat47
## 6 sequences with 626 character and 29 different site patterns.
## The states are a c g t
##
## $patr_poat48
## 8 sequences with 560 character and 24 different site patterns.
## The states are a c g t
##
## $patr_poat49
## 5 sequences with 556 character and 8 different site patterns.
## The states are a c g t
##
##
## Slot "labels":
## [1] "2340_50156.ab1 " "2340_50149.ab1 " "2340_50674.ab1 " "2370_45312.ab1 "
## [5] "2340_50406.ab1 " "2370_45424.ab1 " "2370_45311.ab1 " "2370_45521.ab1 "
##
## Slot "n.ind":
## [1] 8
##
## Slot "n.seq":
## [1] 24
##
## Slot "ind.info":
## NULL
##
## Slot "gene.info":
## NULL
Several functions facilitate data handling:
Example code:
files <- dir(system.file(package="apex"),patter="patr", full=TRUE)
files
## [1] "/home/thibaut/dev/apex/inst/patr_poat43.fasta"
## [2] "/home/thibaut/dev/apex/inst/patr_poat47.fasta"
## [3] "/home/thibaut/dev/apex/inst/patr_poat48.fasta"
## [4] "/home/thibaut/dev/apex/inst/patr_poat49.fasta"
## read files
x <- read.multiFASTA(files)
x
## === multidna ===
## [ 24 DNA sequences in 4 genes ]
##
## @n.ind: 8 individuals
## @n.seq: 24 sequences in total
## @labels: 2340_50156.ab1 2340_50149.ab1 2340_50674.ab1 2370_45312.ab1 2340_50406.ab1 2370_45424.ab1 ...
##
## @dna:
## $patr_poat43
## 8 DNA sequences in binary format stored in a matrix.
##
## All sequences of same length: 764
##
## Labels: 2340_50156.ab1 2340_50149.ab1 2340_50674.ab1 2370_45312.ab1 2340_50406.ab1 2370_45424.ab1 ...
##
## Base composition:
## a c g t
## 0.320 0.158 0.166 0.356
##
## $patr_poat47
## 8 DNA sequences in binary format stored in a matrix.
##
## All sequences of same length: 626
##
## Labels: 2340_50156.ab1 2340_50149.ab1 2340_50674.ab1 2370_45312.ab1 2340_50406.ab1 2370_45424.ab1 ...
##
## Base composition:
## a c g t
## 0.227 0.252 0.256 0.266
##
## $patr_poat48
## 8 DNA sequences in binary format stored in a matrix.
##
## All sequences of same length: 560
##
## Labels: 2340_50156.ab1 2340_50149.ab1 2340_50674.ab1 2370_45312.ab1 2340_50406.ab1 2370_45424.ab1 ...
##
## Base composition:
## a c g t
## 0.305 0.185 0.182 0.327
##
## $patr_poat49
## 8 DNA sequences in binary format stored in a matrix.
##
## All sequences of same length: 556
##
## Labels: 2340_50156.ab1 2340_50149.ab1 2340_50674.ab1 2370_45312.ab1 2340_50406.ab1 2370_45424.ab1 ...
##
## Base composition:
## a c g t
## 0.344 0.149 0.187 0.320
par(mar=c(6,11,2,1))
plot(x)
## subset
plot(x[1:3,2:4])
## concatenate
y <- concatenate(x)
y
## 8 DNA sequences in binary format stored in a matrix.
##
## All sequences of same length: 2506
##
## Labels: 2340_50156.ab1 2340_50149.ab1 2340_50674.ab1 2370_45312.ab1 2340_50406.ab1 2370_45424.ab1 ...
##
## Base composition:
## a c g t
## 0.298 0.187 0.197 0.319
par(mar=c(5,8,2,1))
image(y)
## concatenate multiphyDat object
z <- multidna2multiphyDat(x)
u <- concatenate(z)
u
## 8 sequences with 2506 character and 69 different site patterns.
## The states are a c g t
tree <- pratchet(u, trace=0)
plot(tree, "u")
One can build neighbor joining trees from for each gene or pooled genes for multidna objects
## make trees, default parameters
trees <- getTree(x)
trees
## 4 phylogenetic trees
plot(trees, 4, type="unrooted")
##
## Phylogenetic tree with 8 tips and 6 internal nodes.
##
## Tip labels:
## 2340_50156.ab1 , 2340_50149.ab1 , 2340_50674.ab1 , 2370_45312.ab1 , 2340_50406.ab1 , 2370_45424.ab1 , ...
##
## Unrooted; includes branch lengths.
or can uses functions from phangorn
to estimate with maximum likelihood models
pp <- pmlPart(bf ~ edge + nni, z, control = pml.control(trace = 0))
## Warning in pml(tree, x, ...): negative edges length changed to 0!
## Warning in pml(tree, x, ...): negative edges length changed to 0!
## Warning in pml(tree, x, ...): negative edges length changed to 0!
## Warning in pml(tree, x, ...): negative edges length changed to 0!
## [1] -3510
## [1] -3510
## [1] -3510
## [1] -3510
pp
##
## loglikelihood: -3510
##
## loglikelihood of partitions:
## -1021 -933.9 -788.8 -767
## AIC: 7131 BIC: 7451
##
## Proportion of invariant sites: 0 0 0 0
##
## Rates:
## 1 1 1 1
##
## Base frequencies:
## [,1] [,2] [,3] [,4]
## [1,] 0.2989 0.1888 0.1946 0.3177
##
## Rate matrix:
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 1 1 1 1 1
trees <- pmlPart2multiPhylo(pp)
plot(trees, 4)
The following functions enable the export from apex to other packages:
## find source files in apex
files <- dir(system.file(package="apex"),patter="patr", full=TRUE)
## import data
x <- read.multiFASTA(files)
x
## === multidna ===
## [ 24 DNA sequences in 4 genes ]
##
## @n.ind: 8 individuals
## @n.seq: 24 sequences in total
## @labels: 2340_50156.ab1 2340_50149.ab1 2340_50674.ab1 2370_45312.ab1 2340_50406.ab1 2370_45424.ab1 ...
##
## @dna:
## $patr_poat43
## 8 DNA sequences in binary format stored in a matrix.
##
## All sequences of same length: 764
##
## Labels: 2340_50156.ab1 2340_50149.ab1 2340_50674.ab1 2370_45312.ab1 2340_50406.ab1 2370_45424.ab1 ...
##
## Base composition:
## a c g t
## 0.320 0.158 0.166 0.356
##
## $patr_poat47
## 8 DNA sequences in binary format stored in a matrix.
##
## All sequences of same length: 626
##
## Labels: 2340_50156.ab1 2340_50149.ab1 2340_50674.ab1 2370_45312.ab1 2340_50406.ab1 2370_45424.ab1 ...
##
## Base composition:
## a c g t
## 0.227 0.252 0.256 0.266
##
## $patr_poat48
## 8 DNA sequences in binary format stored in a matrix.
##
## All sequences of same length: 560
##
## Labels: 2340_50156.ab1 2340_50149.ab1 2340_50674.ab1 2370_45312.ab1 2340_50406.ab1 2370_45424.ab1 ...
##
## Base composition:
## a c g t
## 0.305 0.185 0.182 0.327
##
## $patr_poat49
## 8 DNA sequences in binary format stored in a matrix.
##
## All sequences of same length: 556
##
## Labels: 2340_50156.ab1 2340_50149.ab1 2340_50674.ab1 2370_45312.ab1 2340_50406.ab1 2370_45424.ab1 ...
##
## Base composition:
## a c g t
## 0.344 0.149 0.187 0.320
## export to genind
obj1 <- multidna2genind(x)
obj1
##
## #####################
## ### Genind object ###
## #####################
## - genotypes of individuals -
##
## S4 class: genind
## @call: DNAbin2genind(x = concatenate(x, genes = genes))
##
## @tab: 8 x 22 matrix of genotypes
##
## @ind.names: vector of 8 individual names
## @loc.names: vector of 11 locus names
## @loc.nall: number of alleles per locus
## @loc.fac: locus factor for the 22 columns of @tab
## @all.names: list of 11 components yielding allele names for each locus
## @ploidy: 1 1 1 1 1 1
## @type: codom
##
## Optional contents:
## @strata: - empty -
## @hierarchy: - empty -
## @pop: - empty -
## @pop.names: - empty -
##
## @other: - empty -
obj2 <- multiphyDat2genind(x)
identical(obj1, obj2)
## [1] TRUE