Eisen Lab Blog

Useful comparative analysis of sequence classification systems w/ a few questionable bits

There is a useful new publication just out: BMC Bioinformatics | Abstract | A comparative evaluation of sequence classification programs by Adam L Bazinet and Michael P Cummings.  In the paper the authors attempt to do a systematic comparison of tools for classifying DNA sequences according to the taxonomy of the organism from which they come.

I have been interested in such activities since, well, since 1989 when I started working in Colleen Cavanaugh’s lab at Harvard sequencing rRNA genes to do classification.  And I have known one of the authors, Michael Cummings for almost as long.

Their abstract does a decent job of summing up what they did

Background
A fundamental problem in modern genomics is to taxonomically or functionally classify DNA sequence fragments derived from environmental sampling (i.e., metagenomics). Several different methods have been proposed for doing this effectively and efficiently, and many have been implemented in software. In addition to varying their basic algorithmic approach to classification, some methods screen sequence reads for ’barcoding genes’ like 16S rRNA, or various types of protein-coding genes. Due to the sheer number and complexity of methods, it can be difficult for a researcher to choose one that is well-suited for a particular analysis. 

Results
We divided the very large number of programs that have been released in recent years for solving the sequence classification problem into three main categories based on the general algorithm they use to compare a query sequence against a database of sequences. We also evaluated the performance of the leading programs in each category on data sets whose taxonomic and functional composition is known. 

Conclusions
We found significant variability in classification accuracy, precision, and resource consumption of sequence classification programs when used to analyze various metagenomics data sets. However, we observe some general trends and patterns that will be useful to researchers who use sequence classification programs.

The three main categories of methods they identified are

  • Programs that primarily utilize sequence similarity search
  • Programs that primarily utilize sequence composition models (like CompostBin from my lab)
  • Programs that primarily utilize phylogenetic methods (like AMPHORA & STAP from my lab)
The paper has some detailed discussion and comparison of some of the methods in each category.  They even made a tree of the methods
Figure 1. Program clustering. A neighbor-joining tree
 that clusters the classification programs based on their similar attributes. From here.
In some ways – I love this figure.  Since, well, I love trees.  But in other ways I really really really do not like it.  I don’t like it because they use an explicitly phylogenetic method (neighbor joining, which is designed to infer phylogenetic trees and not to simply cluster entities by their similarity) to cluster entities that do not have a phylogenetic history.  Why use neighbor-joining here?  What is the basis for using this method to cluster methods?  It is cute, sure.  But I don’t get it.  What do deep branches represent in this case?  It drives me a bit crazy when people throw a method designed to represent branching history at a situation where clustering by similarity is needed.  Similarly it drives me crazy when similarity based clustering methods are used when history is needed.
Not to take away from the paper too much since this is definitely worth a read for those working on methods to classify sequences as well as for those using such methods.  They even go so far as to test various web served (e.g., MGRAST) and discuss time to get results.  They also test the methods for their precision and sensitivity.  Very useful bits of information here.
So – overall I like the paper.  But one other thing in here sits in my craw in the wrong way.  The discussion of “marker genes.”  Below is some of the introductory text on the topic.  I have labelled some bits I do not like too much:

It is important to note that some supervised learning methods will only classify sequences that contain “marker genes”. Marker genes are ideally present in all organisms, and have a relatively high mutation rate that produces significant variation between species. The use of marker genes to classify organisms is commonly known as DNA barcoding. The 16S rRNA gene has been used to greatest effect for this purpose in the microbial world (green genes [6], RDP [7]). For animals, the mitochondrial COI gene is popular [8], and for plants the chloroplast genes rbcL and matK have been used [9]. Other strategies have been proposed, such as the use of protein-coding genes that are universal, occur only once per genome (as opposed to 16S rRNA genes that can vary in copy number), and are rarely horizontally transferred [10]. Marker gene databases and their constitutive multiple alignments and phylogenies are usually carefully curated, so taxonomic and functional assignments based on marker genes are likely to show gains in both accuracy and speed over methods that analyze input sequences less discriminately. However, if the sequencing was not specially targeted [11], reads that contain marker genes may only account for a small percentage of a metagenomic sample.  

I think I will just leave these highlighted sections uncommented upon and leave it to people to imagine what I don’t like about them .. for now.

Anyway – again – the paper is worth checking out.  And if you want to know more about methods used for classifying sequences see this Mendeley collection which focuses on metagenomic analysis but has many additional paper on top of the ones discussed in this paper.

Interesting new paper: "Proving universal common ancestry with similar sequences"

Just discovered an interesting paper by Leonardo de Oliveira Martins and David Posada.  It is titled “Proving universal common ancestry with similar sequences.”  It relates to a paper by Douglas Theobald: “A formal test of the theory of universal common ancestry. Nature 2010; 465:219-22.” Although the latter paper is not openly available the more recent one is.  


The new paper is worth a look.  Not sure about the Theobald one as I do not have access from home.


Am hoping Leonardo writes more about this in his blog: Bayesian Procedures in Biology ….

Candidate Sequencing Organism – TDU (Microbacterium oxydans)

Micrococcus luteus wasn’t interesting enough to warrant further analysis, so I have picked another organism, TDU, to begin constructing a genomic library of.  It appears to be within the Microbacterium genus, and shows identical similarity with the species oxydans.  We actually isolated several Microbacterium colonies throughout this project from different sources, so I had a number of samples of which I could choose from to begin moving forward with.

A phylogeny of the different Microbacterium samples we isolated was built by David Coil, to help me visualize how similar the samples are to the published genome. There is one completed and published genome in the Microbacterium genus, for the species testaceum, so the goal of the tree was to help me pick from the most divergent organisms to minimize the chances of a duplicate publication of the same organism’s genome. This tree shows the comparative similarities of the Microbacterium species we found with the published Microbacterium testaceum genome recovered, with the most divergent organisms appearing to the left. UPDATE: Two outgroups have been added to further illustrate the degree of divergence

M. testaceum 16S sequence_alignment_tree

The three samples I picked as the best candidates were AV2, TDU, and TFU (TJU was a difficult and sloppy process to isolate, so I played it safe and avoided it altogether). AV2 had very, very low concentrations of DNA in the genomic preparation, so the sample was discarded.  TDU and TFU both contained high levels of genomic DNA in their genomic preparations, so both were still equally viable as candidates. When I checked the glycerol stocks of both organisms on plates however, TFU appeared to have slight contamination (which is really bad, considering these stocks are our last resource for obtaining pure samples of these organisms). This confirmed TDU as the Microbacterium oxydans sample that I will begin working with to construct a  library of.

Currently, the dilution streak of TDU is incubating at 37 degrees C, and tomorrow I will begin the process of confirming the glycerol stock and begin the tagmentation reactions for the genomic library.

Today at #UCDavis Luca Comai “Genome-wide discovery of mutations in rice through exome capture & sequencing”

Genetics Seminar

“Genome-wide discovery of mutations in rice through exome capture and sequencing”

Speaker: Dr. Luca Comai

UC Davis | Plant Biology and Genome Center

Monday, May 14th, 2012

4:10 PM

1022 Life Sciences

__________

Lab meeting. May 16th 2012

Double feature at the Eisen lab meeting this week.

  • Anders Norman, a microbiologist from UC Berkeley, will be telling us about Tracing Host-Plasmid Dynamics in the Deeply Sequenced Acid Mine Drainage System.
  • Lea Benedicte Skov Hansen will also be presenting her work.

  • We will be meeting from 1:30 to 3:30 in the genome center in room 5206.

    BAY AREA BIOSYSTEMATISTS (BABS) MEETING 5/22

    Bay Area Biosystematists (BABS) Meeting

    Tuesday evening, 22 May 2012

    at UC Davis, 1022 Life Sciences Building

    “PHYLOGENOMICS AND SYSTEMATICS”

    The genomics era holds great promise (and challenge) to systematics. There is the prospect of generating sequence data that will provide unprecedented resolution of phylogenetic relationships across the Tree of Life, and a much improved understanding of the tempo and mode of evolution. Join us for two talks on phylogenomics, along with plenty of discussion, leavened by pizza and beer.

    Featuring presentations by…

    HOLLY BIK, Postdoctoral Researcher, Eisen Lab, UC Davis Genome Center

    “Assembling multi-species genomic data”

    and…

    BASTIEN BOUSSAU, Postdoctoral Fellow, Huelsenbeck Lab, UC Berkeley

    “Methods of phylogenetic inference for genome-scale data sets”

    Schedule and venue:

    5:30 pm: social gathering with beverages (beer and soft drinks) and informal

    pizza dinner: cost ca. $10, to be collected at door, 1022 Life Sciences, UC Davis campus.

    7:00 – 9:00 pm: talks, followed by discussion, in same room.

    Reservations required for beverages and dinner (but not the talk). Please email reservations to your host, Phil Ward: psward by Sunday, May 20

    For a map of UC Davis campus and Life Sciences Building:

    http://campusmap.ucdavis.edu/?b=97

    Parking is available in the West Entry Parking Structure, immediately west of Life Sciences. If coming from the Bay Area take the Hwy. 113 exit off I-80, and then the first exit off Hwy 113, which is Hutchison Drive. This will bring you directly to the parking garage. Or, as Google Maps would say:

    All are welcome, members or not. If you want to join the Biosystematists, sign up for our mailing list at:

    https://calmail.berkeley.edu/manage/list/listinfo/babs-l@lists.berkeley.edu

    Mini post: Microbial forensics

    A few months old here but there is a very interesting post from the Science Media Centre in New Zealand: Science Media Centre: Microbes in soil could help fight crime.  The post describes attempts to use microbes in soil as part of forensic activities.  This relates in many ways to my call for a “Field Guide to Microbes”.

    I have been interested in microbial forensics for many years since I worked at TIGR on part of the project to study anthrax genomes.  For those interested in microbe-related forensic activities I have created a Mendeley collection of references on the topic.

    http://www.mendeley.com/groups/1147121/_/widget/29/5/

    Oh the irony – new #OpenAccess #PLoSOne paper on Research Blogs doesn’t share data behind analyses.

    Interesting new paper: PLoS ONE: Research Blogs and the Discussion of Scholarly Information. All about the new world of science blogging.  Much of the context here relates to openness.  Yet as far as I can tell, the data collected that make up the meat of the analyses in the paper, are not shared.  Uggh.

    Is there something I am missing here? Shouldn’t a prerequisite of publishing this kind of paper be sharing the information / data used in the analyses?  Shouldn’t that be released with the paper?

    Definitely time to start “Open Data Watch” where people have a place to complain about lack of open availability of data behind papers (I came up with the name as a mimic of Ivan Oransky’s diverse watch sites like Retraction Watch).  Originally in thinking about doing this I had been thinking about genomic data.  But I am sure this is a problem in other areas.  Consider paleontology, where openness to fossils and other samples is, well, not as common as it should be.  It is not that hard anymore to find a place to share one’s data.  With places like Data Dryad and Biotorrents and FigShare and Merritt and 100s of others it is really inexcusable not to share the data behind a paper in most cases.  Certainly, in some cases there maybe privacy issues but that is not the case here (I think) and not an issue in most cases.

    Come on people.  If scientific papers are to be reproducible and testable, you need to give people access to the data you used. ResearchBlogging.org Shema, H., Bar-Ilan, J., & Thelwall, M. (2012). Research Blogs and the Discussion of Scholarly Information PLoS ONE, 7 (5) DOI: 10.1371/journal.pone.0035869

    John Roth seminar “Does RecA activity PREVENT chromosome rearrangements?” 5/14

    MIC 275 Rec Repair Club

    Monday May 14, 2012
    LS 1022
    10 Am

    John Roth:
    Mechanisms of duplication formation:
    Does RecA activity PREVENT chromosome rearrangements?

    ‘Danger and Evolution in the Twilight Zone’: Guest post by Randen Patterson and Gaurav Bhardwaj

    Figure 1. PHYRN concept and work flow.

    ‘Danger and Evolution in the twilight zone’

    I have been communicating with Randen Patterson on and off over the last five years or so about his efforts to try and study the evolution of gene families when the sequence similarity in the gene family is so low that making multiple sequence alignments are very difficult.  Recently, Randen moved to UC Davis so I have been talking / emailing with jim more and more about this issue.  Of note, Randen has a new paper in PLoS One about this topic: Bhardwaj G, Ko KD, Hong Y, Zhang Z, Ho NL, et al. (2012) PHYRN: A Robust Method for Phylogenetic Analysis of Highly Divergent Sequences. PLoS ONE 7(4): e34261. doi:10.1371/journal.pone.0034261.

    Figure 8. Model for the Evolution of the DANGER Superfamily.

    I invited Randen and the first author Gaurav Bhardwaj to do a guest post here providing some of the story behind their paper for my ongoing series on this topic.  I note – if you have published an open access paper on some topic related to this blog I would love to have a guest post from you too.   I note – I personally love the fact that they used the “DANGER” family as an example to test their method.

    Here is their guest post:

    A fundamental problem to phylogenetic inference in the “twilight zone” (<25% pairwise identity), let alone the “midnight zone” (<12% pairwise identity), is the inability to accurately assign evolutionary relationships at these levels of divergence with statistical confidence. This lack of resolution arises from difficulties in separating the phylogenetic signal from the random noise at these levels of divergence. This obviously and ultimately stymies all attempts to truly resolve the Tree of Life. Since most attempts at phylogenetic inferences in twilight/midnight zone have relied on MSA, and with no clear answer on the best phylogenetic methods to resolve protein families in twilight/midnight zone, we have presented rest of this blog post as two questions representative of these problems.  

    Question1: Is MSA required for accurate phylogenetic inference? 

    Our Opinion: MSA is an excellent tool for the inference from conserved data sets, but it has been shown by others and us, that the quality of MSA degrades rapidly in the twilight zone. Further, the quest for an optimal MSA becomes increasingly difficult with increased number of taxa under study. Although, quality of MSA methods has improved in last two decades, we have not made significant improvements towards overcoming these problems. Multiple groups have also designed alignment-free methods (see Hohl and Ragan, Syst. Biol. 2007), but so far none of these methods has been able to provide better phylogenetic accuracy than MSA+ML methods. We recently published a manuscript in PLoS One entitled “PHYRN: A Robust Method for Phylogenetic Analysis of Highly Divergent Sequences” introducing a hybrid profile-based method. Our approach focuses on measuring phylogenetic signal from homologous biological patterns (functional domains, structural folds, etc), and their subsequent amplification and encoding as phylogenetic profile. Further, we adopt a distance estimation algorithm that is alignment-free, and thus bypasses the need for an optimal MSA. Our benchmarking studies with synthetic (from ROSE and Seqgen) and biological datasets show that PHYRN outperforms other traditional methods (distance, parsimony and Maximum Liklihood), and provides significantly accurate phylogenies even in data sets exhibiting ~8% average pairwise identity. While this still needs to be evaluated in other simulations (varying tree shapes, rates, models), we are convinced that these types of methods do work and deserve further exploration. 

    Question 2: How can we as a field critically and fairly evaluate phylogenetic methods? 

    Our Opinion: A similar problem plagued the field of structural biology whereby there were multiple methods for structural predictions, but no clear way of standardizing or evaluating their performance.  An additional problem that applies to phylogenetic inference is that, unlike crystal structures of proteins, phylogenies do not have a corresponding “answer” that can be obtained.  Synthetic data sets have tried to answer this question to a certain extent by simulating protein evolution and providing true evolutionary histories that can be used for benchmarking.  However, these simulations cannot truly replicate biological evolution (e.g. indel distribution, translocations, biologically relevant birth-death models, etc). In our opinion, we need a CASP-like model (solution adopted by our friends in computational structural biology), where same data sets (with true evolutionary history known only to organizers) are inferred by all the research groups, and then submitted for a critical evaluation to the organizers. To convert this thought to reality, we hereby announce CAPE (Critical Assessment of Protein Evolution) for Summer 20132. We are still in pre-production stages, and we welcome any suggestions, comments and inputs about data sets, scoring and evaluating methods.   

    ResearchBlogging.org Bhardwaj, G., Ko, K., Hong, Y., Zhang, Z., Ho, N., Chintapalli, S., Kline, L., Gotlin, M., Hartranft, D., Patterson, M., Dave, F., Smith, E., Holmes, E., Patterson, R., & van Rossum, D. (2012). PHYRN: A Robust Method for Phylogenetic Analysis of Highly Divergent Sequences PLoS ONE, 7 (4) DOI: 10.1371/journal.pone.0034261