kmer: an R package for fast alignment-free clustering of biological sequences

Shaun Wilkinson

2018-03-06


Abstract

Summary: The kmer package contains functions for rapidly computing distance matrices and clustering biological sequence datasets in R, using fast alignment-free k-mer counting. This package has a range of applications including OTU clustering, classification trees, and phylogenetic analysis of very large biological sequence datasets.
Availability and Implementation: The kmer package is released underthe GPL-3 license, and is available for download from CRAN https://CRAN.R-project.org/package=kmer and github https://github.com/shaunpwilkinson/kmer.
Contact: shaunpwilkinson@gmail.com

Introduction

As biological sequence datasets continue to increase in size due to high-throughput sequencing, traditional clustering methods based on multiple sequence alignments, pairwise distance matrices, and agglomerative (bottom-up) tree-building methods can become computationally infeasible. Alternative k-mer counting methods involve enumerating all k-letter words in a sequence through a sliding window of length k. The \(n \times 4^k\) matrix of k-mer counts (where \(n\) is the number of sequences) can then be used in place of a multiple sequence alignment to calculate distances and/or build a phylogenetic tree. Here, we introduce kmer, an R package for clustering large sequence datasets using fast alignment-free k-mer counting. This can be achieved with or without a multiple sequence alignment, and with or without a matrix of pairwise distances. These functions are detailed below with examples of their utility.

Distance matrix computation

The function kcount is used to count k-mers within a sequence or set of sequences, by sliding a window of length k along each sequence and counting the number of times each k-mer appears (for example, the \(4^3 = 64\) possible DNA 3-mers: AAA, AAC, AAG, …, TTT). The kdistance function computes a fast, alignment-free distance matrix, using k-mer counts to derive the pairwise distances. The default distance metric used by kdistance is the k-mer (k-tuple) distance measure outlined in Edgar (2004). For two DNA sequences \(a\) and \(b\), the fractional common k-mer count over the \(4^k\) possible words of length \(k\) is calculated as: \[F = \sum\limits_{\tau}\frac{min (n_a(\tau), n_b (\tau))}{min (L_a , L_b ) - k + 1} \tag{1}\]

where \(\tau\) represents each possible k-mer, \(n_a(\tau)\) and \(n_b(\tau)\) are the number of times \(\tau\) appears in each sequence, \(k\) is the k-mer length and \(L\) is the sequence length. The pairwise distance between \(a\) and \(b\) is then calculated as:

\[d = \frac{log(0.1 + F) - log(1.1)}{log(0.1)} \tag{2}\]

For \(n\) sequences, the kdistance operation has time and memory complexity \(O(n^2)\) and thus can become computationally infeasible when the sequence set is large (i.e. more than ~ 10,000 sequences). As such, the kmer package also offers the function mbed, that only computes the distances from each sequence to a smaller (or equal) sized subset of ‘seed’ sequences (Blackshields et al., 2010). The default behavior of the kdistance function is to select \(t = (log_2n)^2\) seeds by clustering the sequences (k-means algorithm with \(k = t\)), and selecting one representative sequence from each cluster.

DNA and amino acid sequences can be passed to kcount, kdistance and mbed either as a list of non-aligned sequences or a matrix of aligned sequences, preferably in either the “DNAbin” or “AAbin” raw-byte format (see the ape package documentation for more information on these S3 classes). Character sequences are supported; however ambiguity codes may not be recognized or treated appropriately, since raw ambiguities are counted according to their underlying residue frequencies (e.g. the 5-mer “ACRGT” would contribute 0.5 to the tally for “ACAGT” and 0.5 to that of “ACGGT”). This excludes the ambiguity code “N”, which is ignored.

Example 1: Compute k-mer distance matrices for the woodmouse dataset

The R package ape (Paradis et al., 2004) contains a dataset of 15 aligned mitochondrial cytochrome b gene DNA sequences from the woodmouse Apodemus sylvaticus, originally published in Michaux et al. (2003). While the kmer distance functions do not require sequences to be aligned, this example will enable us to compare k-mer distances with the traditional alignment-dependent distances produced by ape::dist.dna. First, load the dataset as follows:

library(ape)
data(woodmouse)
## view the first few rows and columns 
as.character.DNAbin(woodmouse[1:5, 1:5])
#>         [,1] [,2] [,3] [,4] [,5]
#> No305   "n"  "t"  "t"  "c"  "g" 
#> No304   "a"  "t"  "t"  "c"  "g" 
#> No306   "a"  "t"  "t"  "c"  "g" 
#> No0906S "a"  "t"  "t"  "c"  "g" 
#> No0908S "a"  "t"  "t"  "c"  "g"

This is a semi-global (‘glocal’) alignment featuring some incomplete sequences, with unknown characters represented by the ambiguity code “n” (e.g. No305). To avoid artificially inflating the distances between these partial sequences and the others, we first trim the gappy ends by subsetting the global alignment (note that the ape function dist.dna also removes columns with ambiguity codes prior to distance computation by default).

woodmouse <- woodmouse[, apply(woodmouse, 2, function(v) !any(v == 0xf0))]

The following code first computes the full \(n \times n\) distance matrix, and then the embedded distances of each sequence to three randomly selected seed sequences. In both cases the k-mer size is set to 5.

### Compute the full distance matrix and print the first few rows and columns
library(kmer)
woodmouse.kdist <- kdistance(woodmouse, k = 5)
print(as.matrix(woodmouse.kdist)[1:7, 1:7], digits = 2)
#>         No305  No304  No306 No0906S No0908S No0909S No0910S
#> No305   0.000 0.0241 0.0215   0.026   0.027   0.028   0.029
#> No304   0.024 0.0000 0.0042   0.017   0.018   0.025   0.019
#> No306   0.021 0.0042 0.0000   0.014   0.014   0.020   0.014
#> No0906S 0.026 0.0171 0.0136   0.000   0.019   0.025   0.011
#> No0908S 0.027 0.0184 0.0136   0.019   0.000   0.024   0.019
#> No0909S 0.028 0.0246 0.0197   0.025   0.024   0.000   0.026
#> No0910S 0.029 0.0193 0.0145   0.011   0.019   0.026   0.000

### Compute and print the embedded distance matrix
set.seed(999)
seeds <- sample(1:15, size = 3)
woodmouse.mbed <- mbed(woodmouse, seeds = seeds, k = 5)
### remove the attributes for printing by subsetting the distance matrix
print(woodmouse.mbed[,], digits = 2)
#>         No0909S No0913S  No304
#> No305    0.0277  0.0295 0.0241
#> No304    0.0246  0.0085 0.0000
#> No306    0.0197  0.0080 0.0042
#> No0906S  0.0255  0.0175 0.0171
#> No0908S  0.0241  0.0215 0.0184
#> No0909S  0.0000  0.0282 0.0246
#> No0910S  0.0259  0.0145 0.0193
#> No0912S  0.0167  0.0241 0.0202
#> No0913S  0.0282  0.0000 0.0085
#> No1103S  0.0123  0.0193 0.0154
#> No1007S  0.0038  0.0286 0.0250
#> No1114S  0.0340  0.0345 0.0295
#> No1202S  0.0264  0.0145 0.0193
#> No1206S  0.0219  0.0197 0.0158
#> No1208S  0.0038  0.0309 0.0273

Clustering

To avoid excessive time and memory use when building large trees (e.g. n > 10,000), the kmer package features the function cluster for divisive tree building, free of both alignment and distance matrix computation. This function first generates a matrix of k-mer counts, and then recursively partitions the matrix row-wise using successive k-means clustering (k = 2). While this method may not necessarily reconstruct sufficiently accurate phylogenetic trees for taxonomic purposes, it offers a fast and efficient means of producing large trees for a variety of other applications. These include tree-based sequence weighting (e.g. Gerstein et al. (1994)), guide trees for progressive multiple sequence alignment (e.g. Sievers et al. (2011)), and other recursive operations such as classification tree learning.

Example 2: Top-down clustering of the woodmouse data

In this example we will compare the fast top-down method with standard neighbor-joining trees constructed from full distance matrices computed using both the alignment-free k-mer distance and the Kimura (1980) distance as featured in the ape package examples.

## set out plotting panes
par(mfrow = c(1, 3), mar = c(1, 2, 3, 3), cex = 0.3)

## (1) neighbor joining tree with Kimura 1980 distance
### compute the n x n K80 distance matrix 
woodmouse.dist <- ape::dist.dna(woodmouse, model = "K80") 
### build the neighbor-joining tree
tree1.phylo <- ape::nj(woodmouse.dist)
### export as Newick text
tree1.newick <- ape::write.tree(tree1.phylo)
### import as "dendrogram" object
tree1.dendro <- phylogram::read.dendrogram(text = tree1.newick)
### sort nodes by size
tree1.dendro <- phylogram::ladder(tree1.dendro)
### plot the nj tree
plot(tree1.dendro, horiz = TRUE, yaxt = "n", 
     main = "Neighbor-joining tree with\nK80 distance matrix")

## (2) neighbor joining tree with k-mer distance
### compute the n x n k-mer distance matrix 
woodmouse.kdist <- kdistance(woodmouse, k = 5) 
### build the neighbor-joining tree
tree2.phylo <- ape::nj(woodmouse.kdist)
### export as Newick text
tree2.newick <- ape::write.tree(tree2.phylo)
### import as "dendrogram" object
tree2.dendro <- phylogram::read.dendrogram(text = tree2.newick)
### sort nodes by size
tree2.dendro <- phylogram::ladder(tree2.dendro)
### plot the nj tree
plot(tree2.dendro, horiz = TRUE, yaxt = "n", 
     main = "Neighbor-joining tree with\nk-mer distance matrix (k=5)")

## (3) topdown tree without distance matrix
set.seed(999)
tree3 <- cluster(woodmouse, k = 5, nstart = 20)
## sort nodes by size
tree3 <- phylogram::ladder(tree3)
### plot the topdown tree
plot(tree3, horiz = TRUE, yaxt = "n", 
     main = "Top-down tree without\ndistance matrix (k=5)")

Figure 1: Comparison of clustering methods for woodmouse sequences The top-down and neighbor-joining trees show relatively congruent fine-scale topologies, regardless of the distance measure and method used. However, for large sequence sets, the top-down (divisive) method builds trees orders of magnitude faster than traditional alignment and distance matrix-based methods, since computation time and memory use increasing linearly rather than quadratically with n.

This package also features the function otu for rapid clustering of sequences into operational taxonomic units based on a genetic distance (k-mer distance) threshold. This function performs a similar operation to cluster in that it recursively partitions a k-mer count matrix to assign sequences to groups. The top-down splitting continues until the farthest k-mer distance in every cluster is higher than a given threshold value (0.97 by default). Rather than returning a dendrogram, otu returns a named integer vector of cluster membership, with asterisks indicating the representative sequences within each cluster.

Example 3: OTU clustering of the woodmouse data

In this final example, the woodmouse dataset is clustered into operational taxonomic units with a threshold of 0.97 (k-mer distance with k = 5) and with 20 random starts per k-means split (recommended for improved accuracy).

set.seed(999)
woodmouse.OTUs <- otu(woodmouse, k = 5, threshold = 0.97, nstart = 20)
woodmouse.OTUs
#>   No305*    No304   No306*  No0906S  No0908S No0909S*  No0910S  No0912S 
#>        3        1        1        1        1        2        1        2 
#>  No0913S  No1103S  No1007S  No1114S  No1202S  No1206S  No1208S 
#>        1        2        2        3        1        1        2

The function outputs a named integer vector of OTU membership, with asterisks indicating the representative sequence from each cluster (i.e. the most “central” sequence). In this case, three distinct OTUs were found, with No305 and N01114S forming one cluster (3), No0909S, No0912S, No1103S, No1007S and No1208S forming another (2) and the remainder falling into cluster 1 in concordance with the topology of Figure 1c.

Concluding remarks

The kmer package is released underthe GPL-3 license, and is free to distribute under certain conditions; however it comes with no warranty. Please direct bug reports to the GitHub issues page at http://github.com/shaunpwilkinson/kmer/issues. Any feedback is greatly appreciated.

Acknowledgements

This software was developed with funding from a Rutherford Foundation Postdoctoral Research Fellowship from the Royal Society of New Zealand.

References

Blackshields,G. et al. (2010) Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithms for Molecular Biology, 5, 21.

Edgar,R.C. (2004) Local homology recognition and distance measures in linear time using compressed amino acid alphabets. Nucleic Acids Research, 32, 380–385.

Gerstein,M. et al. (1994) Volume changes in protein evolution. Journal of Molecular Biology, 236, 1067–1078.

Kimura,M. (1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution, 16, 111–120.

Michaux,J.R. et al. (2003) Mitochondrial phylogeography of the woodmouse (Apodemus sylvaticus) in the Western Palearctic region. Molecular Ecology, 12, 685–697.

Paradis,E. et al. (2004) APE: analyses of phylogenetics and evolution in R language. Bioinformatics, 20, 289–290.

Sievers,F. et al. (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular Systems Biology, 7, 539.