The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
This vignette demonstrates the implementation of supervised learning in ecological and evolutionary inference. In this vignette, we take the microsatellite genotypes of 15 cattle breeds (Laloë et al. 2007) as an example. We aim to use different supervised leaning techniques to identify the population structure of 15 cattle breeds.
We use the microsatellite genotypes of 15 cattle breeds (Laloë et al. 2007) as an example to show population structure inference and visualization. We compare six approaches that are feasible and suitable for population structure inference here. We use the commonly used unsupervised learning technique, PCA, as the benchmark. We demonstrate how to use these five supervised learning approaches, including DAPC, LFDAPC, LFDA, LFDAKPC, and KLFDA, to identify population structure. These five supervised learning techniques are all from the same discriminant family.
First, we need to install and load the package.
#Install from CRAN
#install.packages("DA")
## or you can get the latest version of HierDpart from github
#library(devtools)
#install_github("xinghuq/DA")
library("DA")
#>
#> Attaching package: 'DA'
#> The following object is masked from 'package:stats':
#>
#> predict
library("kernlab")
# example genepop file
f <- system.file('extdata',package='DA')
infile <- file.path(f, "Cattle_breeds_allele_frequency.csv")
Cattle_pop=file.path(f, "Cattle_pop.csv")
cattle_geno=read.csv(infile,h=T)
cattle_pop=read.csv(Cattle_pop,h=T)
PCA is still one of the most commonly used approaches to study population structure. However, PCs represent the global structure of the data without consideration of variation within classes.
cattle_pop$x=factor(cattle_pop$x,levels = unique(cattle_pop$x))
### PCA
cattle_pc=princomp(cattle_geno[,-1])
#plot the data projection on the components
library(plotly)
#> Loading required package: ggplot2
#> Warning: package 'ggplot2' was built under R version 3.5.3
#>
#> Attaching package: 'ggplot2'
#> The following object is masked from 'package:kernlab':
#>
#> alpha
#>
#> Attaching package: 'plotly'
#> The following object is masked from 'package:ggplot2':
#>
#> last_plot
#> The following object is masked from 'package:stats':
#>
#> filter
#> The following object is masked from 'package:graphics':
#>
#> layout
cols=rainbow(length(unique(cattle_pop$x)))
p0 <- plot_ly(as.data.frame(cattle_pc$scores), x =cattle_pc$scores[,1], y =cattle_pc$scores[,2], color = cattle_pop$x,colors=cols[cattle_pop$x],symbol = cattle_pop$x,symbols = 1:15L) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'PC1'),
yaxis = list(title = 'PC2')))
p0
#> Warning: `arrange_()` was deprecated in dplyr 0.7.0.
#> Please use `arrange()` instead.
#> See vignette('programming') for more help
#> Warning: The shape palette can deal with a maximum of 6 discrete values because
#> more than 6 becomes difficult to discriminate; you have 15. Consider
#> specifying shapes manually if you must have them.
Using DAPC to display the pop structure is a common means in population genetics. This can be achieved through “adegenet” package.
library(adegenet)
#> Warning: package 'adegenet' was built under R version 3.5.3
#> Loading required package: ade4
#>
#> /// adegenet 2.1.2 is loaded ////////////
#>
#> > overview: '?adegenet'
#> > tutorials/doc/questions: 'adegenetWeb()'
#> > bug reports/feature requests: adegenetIssues()
cattle_pop$x=factor(cattle_pop$x,levels = unique(cattle_pop$x))
###DAPC
cattle_dapc=dapc(cattle_geno[,-1],grp=cattle_pop$x,n.pca=10, n.da=3)
#plot the data projection on the components
library(plotly)
cols=rainbow(length(unique(cattle_pop$x)))
p1 <- plot_ly(as.data.frame(cattle_dapc$ind.coord), x =cattle_dapc$ind.coord[,1], y =cattle_dapc$ind.coord[,2], color = cattle_pop$x,colors=cols[cattle_pop$x],symbol = cattle_pop$x,symbols = 1:15L) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'LDA1'),
yaxis = list(title = 'LDA2')))
p1
#> Warning: The shape palette can deal with a maximum of 6 discrete values because
#> more than 6 becomes difficult to discriminate; you have 15. Consider
#> specifying shapes manually if you must have them.
Fig.2 DAPC plot of 15 cattle breeds. This is an interactive plot that allows you to point the data values and display the value as you wish.
Discriminant analysis of kernel principal components (DAKPC) is a variant of DAPC. However, people try to incorporate the non-linear relationship between loci and samples, so that the kernel principal component analysis is emolyed to achieve this goal. Below is the implementation of DAKPC.
cattle_ldakpc=LDAKPC(cattle_geno[,-1],cattle_pop$x,n.pc=3)
cols=rainbow(length(unique(cattle_pop$x)))
p2 <- plot_ly(as.data.frame(cattle_ldakpc$LDs), x =cattle_ldakpc$LDs[,1], y =cattle_ldakpc$LDs[,2], color = cattle_pop$x,colors=cols[cattle_pop$x],symbol = cattle_pop$x,symbols = 1:15L) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'LDA1'),
yaxis = list(title = 'LDA2')))
p2
#> Warning: The shape palette can deal with a maximum of 6 discrete values because
#> more than 6 becomes difficult to discriminate; you have 15. Consider
#> specifying shapes manually if you must have them.
LDAKPC has the similar result with DAPC.
In comparison to LDA, LFDA not only considers the variation between classes, but also the variation within classes. Thus, LFDA can discriminate the multimodal data while LDA can not. LFDA is an upgraded version of LDA.
cattle_lfda=LFDA(cattle_geno[,-1],cattle_pop$x,r=3,tol=1E-3)
cols=rainbow(length(unique(cattle_pop$x)))
p3 <- plot_ly(as.data.frame(cattle_lfda$Z), x =cattle_lfda$Z[,1], y =cattle_lfda$Z[,2], color = cattle_pop$x,colors=cols[cattle_pop$x],symbol = cattle_pop$x,symbols = 1:15L) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'LDA1'),
yaxis = list(title = 'LDA2')))
p3
#> Warning: The shape palette can deal with a maximum of 6 discrete values because
#> more than 6 becomes difficult to discriminate; you have 15. Consider
#> specifying shapes manually if you must have them.
As LFDA is more advanced than LDA, I adopt LFDA for discriminant analysis on the basis of LDAKPC. Now we get LFDAKPC, Local (Fisher) Discriminant Analysis of Kernel Principal Components (LFDAKPC). Below is the implementation of LFDAKPC.
cattle_lfdakpc=LFDAKPC(cattle_geno[,-1],cattle_pop$x,n.pc=3,tol=1E-3)
cols=rainbow(length(unique(cattle_pop$x)))
p4 <- plot_ly(as.data.frame(cattle_lfdakpc$LDs), x =cattle_lfdakpc$LDs[,1], y =cattle_lfdakpc$LDs[,2], color = cattle_pop$x,colors=cols[cattle_pop$x],symbol = cattle_pop$x,symbols = 1:15L) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'LDA1'),
yaxis = list(title = 'LDA2')))
p4
#> Warning: The shape palette can deal with a maximum of 6 discrete values because
#> more than 6 becomes difficult to discriminate; you have 15. Consider
#> specifying shapes manually if you must have them.
Fig.5 LFDAKPC plot of 15 cattle breeds.
The LFDAKPC also produces the similar results as LDAKPC and DAPC.
Kernel local (Fisher) discriminant analysis (KLFDA) is a kernelized version of local Fisher discriminant analysis (LFDA). KLFAD can capature the non-linear relationships between samples. It was reported that the discrimintory power of KLFDA was significantly improved compared to LDA.
cattle_klfda=KLFDA(as.matrix(cattle_geno[,-1]),as.factor(cattle_pop$x),r=3,tol=1E-10,prior = NULL)
cols=rainbow(length(unique(cattle_pop$x)))
p5 <- plot_ly(as.data.frame(cattle_klfda$Z), x =cattle_klfda$Z[,1], y =cattle_klfda$Z[,2], color = cattle_pop$x,colors=cols[cattle_pop$x],symbol = cattle_pop$x,symbols = 1:15L) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'LDA1'),
yaxis = list(title = 'LDA2')))
p5
#> Warning: The shape palette can deal with a maximum of 6 discrete values because
#> more than 6 becomes difficult to discriminate; you have 15. Consider
#> specifying shapes manually if you must have them.
KLFDA seems present the aggregates that are more convergent than the above methods.
All the above methods show the same global structure for 15 cattle breeds.
Kernel local (Fisher) discriminant analysis (KLFDA) is the optimal approach for population structurte inference when tested using this cattle data. Now, we plot the cattle individual membership representing the posterior possibilities of individuals as the population structure. This gives the similar plot produced from STRUCTURE software.
library(adegenet)
## asignment plot
compoplot(as.matrix(cattle_klfda$bayes_assigment$posterior),show.lab = TRUE, posi=list(x=5,y=-0.01),txt.leg = unique(cattle_pop$x))
Fig. 7 The population structure of Cattle breeds (individual assignment)
More tutorials can be found at URL: https://xinghuq.github.io/DA/articles/index.html.
Laloë, D., Jombart, T., Dufour, A.-B. & Moazami-Goudarzi, K. (2007). Consensus genetic structuring and typological value of markers using multiple co-inertia analysis. Genetics Selection Evolution, 39, 545.
Jombart, T. (2008). adegenet: a R package for the multivariate analysis of genetic markers. Bioinformatics, 24, 1403-1405. Sugiyama, M (2007).Dimensionality reduction of multimodal labeled data by local Fisher discriminant analysis. Journal of Machine Learning Research, vol.8, 1027-1061.
Sugiyama, M (2006). Local Fisher discriminant analysis for supervised dimensionality reduction. In W. W. Cohen and A. Moore (Eds.), Proceedings of 23rd International Conference on Machine Learning (ICML2006), 905-912.
Tang, Y., & Li, W. (2019). lfda: Local Fisher Discriminant Analysis inR. Journal of Open Source Software, 4(39), 1572.
Moore, A. W. (2004). Naive Bayes Classifiers. In School of Computer Science. Carnegie Mellon University.
Pierre Enel (2020). Kernel Fisher Discriminant Analysis (https://github.com/p-enel/MatlabKFDA), GitHub. Retrieved March 30, 2020.
Karatzoglou, A., Smola, A., Hornik, K., & Zeileis, A. (2004). kernlab-an S4 package for kernel methods in R. Journal of statistical software, 11(9), 1-20.
Bingpei Wu, 2012, WMDB 1.0: Discriminant Analysis Methods by Weight Mahalanobis Distance and bayes.
Ito, Y., Srinivasan, C., & Izumi, H. (2006, September). Discriminant analysis by a neural network with Mahalanobis distance. In International Conference on Artificial Neural Networks (pp. 350-360). Springer, Berlin, Heidelberg.
Wölfel, M., & Ekenel, H. K. (2005, September). Feature weighted Mahalanobis distance: improved robustness for Gaussian classifiers. In 2005 13th European signal processing conference (pp. 1-4). IEEE.
Qin, X., Wu, M., Lock, R., Kallenbach, R. (2020). DA: Ecological and Evolutionary Inference Using Supervised Discriminant Analysis. Authorea.DOI: 10.22541/au.159256808.83862168
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.