Guest post from Rachid Ounit on CLARK: Fast and Accurate Classification of Metagenomic and Genomic Sequences

Recently I received and email from Rachid Ounit pointing me to a new open access paper he had on a metagenomics analysis tool called CLARK.  I asked him if he would be willing to write a guest post about it and, well, he did.  Here is it:

CLARK: Accurate metagenomic analysis of a million reads in 20 seconds or less…
At the University of California, Riverside, we have developed a new lightweight algorithm to classify accurately metagenomic samples while minimizing computational resources better than any other classifiers (e.g., Kraken).  While CLARK and Kraken have comparable accuracy, CLARK is significantly faster (cf. Fig. a) and uses less RAM and disk space (cf. Fig. b-c). In default mode and single-threaded, CLARK’s classification speed is higher than 3 million short reads per minute (cf. Fig. a), and it also scales better in multithreading (cf. Fig. d). Like Kraken, CLARK uses k-mers (short DNA words of length k) to solve the classification problem. However, while Kraken and other k-mers based classifiers consider the whole taxonomy tree and must resolve k-mers that match genomes from different taxa (by using the concept of “lowest common ancestor” from MEGAN), CLARK rather considers taxa defined for a unique taxonomy rank (e.g. species/genus), and, during the preprocessing, discards any k-mers that can be found in any pair of taxon. In other words, CLARK exploits specificities of each taxon (against all others) to populate its light and efficient data structure. It uses a customized dictionary of k-mers, in which each k-mer is associated to at most one taxon and results in fast k-mer queries. Then, the read is assigned to the taxon that has the highest amount of k-mers matches with it. Since these matches are discriminative, CLARK assignments are highly accurate. We also show that the choice of the value of k is critical for the optimal performance, and long k-mers (e.g., 31-mers) are not necessarily the best choice to perform accurate identification.  For example, high confidence assignments using 20-mers from real metagenomes show strong consistency with several published and independent results. 
Finally, CLARK can be used for detecting contamination in draft reference genome or, in genomics, chimera in sequenced BACs. We are currently investigating new techniques for improving the sensitivity and the speed of the tool, and we plan to release a new version later this year. We are also extending the tool for comparative genomics/metagenomics purposes. A “RAM-light” version of CLARK for your 4 GB RAM laptop is also available. CLARK is user-friendly (i.e., easy to use, it does not require strong background in programming/bioinformatics) and self-contained (i.e., does not need depend on any external software tool). The latest version of CLARK (v1.1.2) contains several features to analyze your results and is freely available under the GNU GPL license (for more details, please visit CLARK’s webpage). Experimental results and algorithm details can be found in the BMC genomics manuscript.
Performance of Kraken (v0.10.4-beta) and CLARK (v1.0) for the classification of a metagenome sample of 10,000 reads (average reads length 92bp).  a) The classification speed (in 103 reads per minute) in default mode. b) RAM usage (in GB) for the classification. c) Disk space (in GB) required for the database (bacterial genomes from NCBI/RefSeq). d) Classification speed (in 10^3 reads per minute) using 1, 2, 4 and 8 threads.

Useful comparative analysis of sequence classification systems w/ a few questionable bits

There is a useful new publication just out: BMC Bioinformatics | Abstract | A comparative evaluation of sequence classification programs by Adam L Bazinet and Michael P Cummings.  In the paper the authors attempt to do a systematic comparison of tools for classifying DNA sequences according to the taxonomy of the organism from which they come.

I have been interested in such activities since, well, since 1989 when I started working in Colleen Cavanaugh’s lab at Harvard sequencing rRNA genes to do classification.  And I have known one of the authors, Michael Cummings for almost as long.

Their abstract does a decent job of summing up what they did

A fundamental problem in modern genomics is to taxonomically or functionally classify DNA sequence fragments derived from environmental sampling (i.e., metagenomics). Several different methods have been proposed for doing this effectively and efficiently, and many have been implemented in software. In addition to varying their basic algorithmic approach to classification, some methods screen sequence reads for ’barcoding genes’ like 16S rRNA, or various types of protein-coding genes. Due to the sheer number and complexity of methods, it can be difficult for a researcher to choose one that is well-suited for a particular analysis. 

We divided the very large number of programs that have been released in recent years for solving the sequence classification problem into three main categories based on the general algorithm they use to compare a query sequence against a database of sequences. We also evaluated the performance of the leading programs in each category on data sets whose taxonomic and functional composition is known. 

We found significant variability in classification accuracy, precision, and resource consumption of sequence classification programs when used to analyze various metagenomics data sets. However, we observe some general trends and patterns that will be useful to researchers who use sequence classification programs.

The three main categories of methods they identified are

  • Programs that primarily utilize sequence similarity search
  • Programs that primarily utilize sequence composition models (like CompostBin from my lab)
  • Programs that primarily utilize phylogenetic methods (like AMPHORA & STAP from my lab)
The paper has some detailed discussion and comparison of some of the methods in each category.  They even made a tree of the methods
Figure 1. Program clustering. A neighbor-joining tree
 that clusters the classification programs based on their similar attributes. From here.
In some ways – I love this figure.  Since, well, I love trees.  But in other ways I really really really do not like it.  I don’t like it because they use an explicitly phylogenetic method (neighbor joining, which is designed to infer phylogenetic trees and not to simply cluster entities by their similarity) to cluster entities that do not have a phylogenetic history.  Why use neighbor-joining here?  What is the basis for using this method to cluster methods?  It is cute, sure.  But I don’t get it.  What do deep branches represent in this case?  It drives me a bit crazy when people throw a method designed to represent branching history at a situation where clustering by similarity is needed.  Similarly it drives me crazy when similarity based clustering methods are used when history is needed.
Not to take away from the paper too much since this is definitely worth a read for those working on methods to classify sequences as well as for those using such methods.  They even go so far as to test various web served (e.g., MGRAST) and discuss time to get results.  They also test the methods for their precision and sensitivity.  Very useful bits of information here.
So – overall I like the paper.  But one other thing in here sits in my craw in the wrong way.  The discussion of “marker genes.”  Below is some of the introductory text on the topic.  I have labelled some bits I do not like too much:

It is important to note that some supervised learning methods will only classify sequences that contain “marker genes”. Marker genes are ideally present in all organisms, and have a relatively high mutation rate that produces significant variation between species. The use of marker genes to classify organisms is commonly known as DNA barcoding. The 16S rRNA gene has been used to greatest effect for this purpose in the microbial world (green genes [6], RDP [7]). For animals, the mitochondrial COI gene is popular [8], and for plants the chloroplast genes rbcL and matK have been used [9]. Other strategies have been proposed, such as the use of protein-coding genes that are universal, occur only once per genome (as opposed to 16S rRNA genes that can vary in copy number), and are rarely horizontally transferred [10]. Marker gene databases and their constitutive multiple alignments and phylogenies are usually carefully curated, so taxonomic and functional assignments based on marker genes are likely to show gains in both accuracy and speed over methods that analyze input sequences less discriminately. However, if the sequencing was not specially targeted [11], reads that contain marker genes may only account for a small percentage of a metagenomic sample.  

I think I will just leave these highlighted sections uncommented upon and leave it to people to imagine what I don’t like about them .. for now.

Anyway – again – the paper is worth checking out.  And if you want to know more about methods used for classifying sequences see this Mendeley collection which focuses on metagenomic analysis but has many additional paper on top of the ones discussed in this paper.