Referring to 16S surveys as "metagenomics" is misleading and annoying #badomics #OmicMimicry

Aargh.  I am a big fan if of ribosomal RNA based surveys of microbial diversity.  Been doing them for 20+ years and still continue to – even though I have moved on to more genomic/metagenomic based studies.  But it drives me crazy to see rRNA surveys now being called “metagenomics”.

Here are some examples of cases where rRNA surveys are referred to as metagenomics:

I found these examples in about five minutes of googling.  I am sure there are many many more.  
Why does this drive me crazy?  Because rRNA surveys focus on a single gene.  They are not gnomicy in any way.  Thus it is misleading to refer to rRNA surveys as “metagenomics”.  Why do people do this?  I think it is pretty simple.  Genomics and metagenomics are “hot” topics.  To call what one is doing “metagenomics” makes it sound special.  Well, just like adding an “omic” suffix does not make ones work genomics – falsely labeling work as some kind of “omics” also does not make it genomics.
Enough of this.  If you are doing rRNA surveys of microbial communities – great – I love them.  But do not refer to this work as metagenomics.  If you do, you are being misleading, either accidentally or on purpose.    So I think I need a new category of #badomics – “Omic Mimicry” or something like that …
Note – this post was spurred on by a Twitter conversation – which is captured below (note – I am certain I have complained about this before but cannot find a record of it …)






Nice new memory efficient metagenome assembly method from C. Titus Brown –

Interesting new #OpenAccess PNAS paper from C. Titus Brown: Scaling metagenome sequence assembly with probabilistic de Bruijn graphs.  Of course, if you follow Titus on Twitter or his blog you would know about this already because not only has he posted about it but he posted a preprint of the paper on arXiv in December.

Check out the press release from Michigan State.  Some good lines there like “Analyzing DNA data using traditional computing methods is like trying to eat a large pizza in a single bite.”

A key point in the paper: “The graph representation is based on a probabilistic data structure, a Bloom filter, that allows us to efficiently store assembly graphs in as little as 4 bits per k-mer, albeit inexactly. We show that this data structure accurately represents DNA assembly graphs in low memory.” This is important because right now most assemblers for genome data use a ton of memory.

Anyway the software behind the paper is available on GitHub here.  Assemble away.

Useful comparative analysis of sequence classification systems w/ a few questionable bits

There is a useful new publication just out: BMC Bioinformatics | Abstract | A comparative evaluation of sequence classification programs by Adam L Bazinet and Michael P Cummings.  In the paper the authors attempt to do a systematic comparison of tools for classifying DNA sequences according to the taxonomy of the organism from which they come.

I have been interested in such activities since, well, since 1989 when I started working in Colleen Cavanaugh’s lab at Harvard sequencing rRNA genes to do classification.  And I have known one of the authors, Michael Cummings for almost as long.

Their abstract does a decent job of summing up what they did

A fundamental problem in modern genomics is to taxonomically or functionally classify DNA sequence fragments derived from environmental sampling (i.e., metagenomics). Several different methods have been proposed for doing this effectively and efficiently, and many have been implemented in software. In addition to varying their basic algorithmic approach to classification, some methods screen sequence reads for ’barcoding genes’ like 16S rRNA, or various types of protein-coding genes. Due to the sheer number and complexity of methods, it can be difficult for a researcher to choose one that is well-suited for a particular analysis. 

We divided the very large number of programs that have been released in recent years for solving the sequence classification problem into three main categories based on the general algorithm they use to compare a query sequence against a database of sequences. We also evaluated the performance of the leading programs in each category on data sets whose taxonomic and functional composition is known. 

We found significant variability in classification accuracy, precision, and resource consumption of sequence classification programs when used to analyze various metagenomics data sets. However, we observe some general trends and patterns that will be useful to researchers who use sequence classification programs.

The three main categories of methods they identified are

  • Programs that primarily utilize sequence similarity search
  • Programs that primarily utilize sequence composition models (like CompostBin from my lab)
  • Programs that primarily utilize phylogenetic methods (like AMPHORA & STAP from my lab)
The paper has some detailed discussion and comparison of some of the methods in each category.  They even made a tree of the methods
Figure 1. Program clustering. A neighbor-joining tree
 that clusters the classification programs based on their similar attributes. From here.
In some ways – I love this figure.  Since, well, I love trees.  But in other ways I really really really do not like it.  I don’t like it because they use an explicitly phylogenetic method (neighbor joining, which is designed to infer phylogenetic trees and not to simply cluster entities by their similarity) to cluster entities that do not have a phylogenetic history.  Why use neighbor-joining here?  What is the basis for using this method to cluster methods?  It is cute, sure.  But I don’t get it.  What do deep branches represent in this case?  It drives me a bit crazy when people throw a method designed to represent branching history at a situation where clustering by similarity is needed.  Similarly it drives me crazy when similarity based clustering methods are used when history is needed.
Not to take away from the paper too much since this is definitely worth a read for those working on methods to classify sequences as well as for those using such methods.  They even go so far as to test various web served (e.g., MGRAST) and discuss time to get results.  They also test the methods for their precision and sensitivity.  Very useful bits of information here.
So – overall I like the paper.  But one other thing in here sits in my craw in the wrong way.  The discussion of “marker genes.”  Below is some of the introductory text on the topic.  I have labelled some bits I do not like too much:

It is important to note that some supervised learning methods will only classify sequences that contain “marker genes”. Marker genes are ideally present in all organisms, and have a relatively high mutation rate that produces significant variation between species. The use of marker genes to classify organisms is commonly known as DNA barcoding. The 16S rRNA gene has been used to greatest effect for this purpose in the microbial world (green genes [6], RDP [7]). For animals, the mitochondrial COI gene is popular [8], and for plants the chloroplast genes rbcL and matK have been used [9]. Other strategies have been proposed, such as the use of protein-coding genes that are universal, occur only once per genome (as opposed to 16S rRNA genes that can vary in copy number), and are rarely horizontally transferred [10]. Marker gene databases and their constitutive multiple alignments and phylogenies are usually carefully curated, so taxonomic and functional assignments based on marker genes are likely to show gains in both accuracy and speed over methods that analyze input sequences less discriminately. However, if the sequencing was not specially targeted [11], reads that contain marker genes may only account for a small percentage of a metagenomic sample.  

I think I will just leave these highlighted sections uncommented upon and leave it to people to imagine what I don’t like about them .. for now.

Anyway – again – the paper is worth checking out.  And if you want to know more about methods used for classifying sequences see this Mendeley collection which focuses on metagenomic analysis but has many additional paper on top of the ones discussed in this paper.

Phylogenetic analysis of metagenomic data – Mendeley group …

Just a little plug for a Mendeley reference collection I have been helping make on “Phylogenetic and related analyses of metagenomic data.” If you want to know more about such studies you can find a growing list of publications at they group collection.

Phylogenetic and related analyses of metagenomic data is a group in Biological Sciences on Mendeley.

OMICS Driven Microbial Ecology …

Quick post here.  Just discovered a nice review paper by Suenaga on targeted metagenomics: Targeted metagenomics: a high-resolution metagenomics approach for specific gene clusters in complex microbial communities – Suenaga – 2011 – Environmental Microbiology

This “Special Issue” on “OMICS Driven Microbial Ecology” has a series of papers, all of which seem to be freely available, of potential interest to readers of this blog including:

and more
Oh, and a paper of mine (with Alex Worden and other members of her lab as well as multiple others)

PCR amplification and pyrosequencing of rpoB as complement to rRNA

Figure 1. Number of OTUs as
 a function of fractional sequence difference
 (OTU cut-off) for the 16S rRNA marker
 gene (A) and the rpoB marker gene (B).

Interesting new paper in PLoS One: PLoS ONE: A Comparison of rpoB and 16S rRNA as Markers in Pyrosequencing Studies of Bacterial Diversity

In the paper they test and use PCR amplification and pyrosequencing of the rpoB gene for studies of the diversity of bacteria. Due to the lower level of conservation of rpoB than rRNA genes at the DNA level they focused on proteobacteria. It seems that with a little perseverance once can get PCR for protein coding genes to work reasonably well for even reasonably broad taxonomic groups (not totally new here but I am not aware of too many papers doing this with pyrosequencing). Anyway, the paper is worth a look.

 Vos M, Quince C, Pijl AS, de Hollander M, Kowalchuk GA (2012) A Comparison of rpoB and 16S rRNA as Markers in Pyrosequencing Studies of Bacterial Diversity. PLoS ONE 7(2): e30600. doi:10.1371/journal.pone.0030600 Vos, M., Quince, C., Pijl, A., de Hollander, M., & Kowalchuk, G. (2012). A Comparison of rpoB and 16S rRNA as Markers in Pyrosequencing Studies of Bacterial Diversity PLoS ONE, 7 (2) DOI: 10.1371/journal.pone.0030600

Cool paper from DerisiLab on viruses in unknown tropical febrile illnesses #metagenomics #viroarray

Quick post:

Figure 3. Circovirus-like
NI sequence coverage and phylogeny.

Cool new paper from Joe Derisi’s lab: PLoS Neglected Tropical Diseases: Virus Identification in Unknown Tropical Febrile Illness Cases Using Deep Sequencing

Full citation: Yozwiak NL, Skewes-Cox P, Stenglein MD, Balmaseda A, Harris E, et al. (2012) Virus Identification in Unknown Tropical Febrile Illness Cases Using Deep Sequencing. PLoS Negl Trop Dis 6(2): e1485. doi:10.1371/journal.pntd.0001485

They used a combination of a viral microarray and metagenomic sequencing to characterize viruses in various samples from patients with febrile illness.  And they found some semi-novel viruses in the sample.  Definitely worth a look.

Note – here are some other posts of mine about Derisi:

See some follow up discussion on Google+ here.

Microbial metaomics discussion group this week: metatranscriptomics and biogeography

A visiting student at my lab Lea Benedicte Skov Hansen will be leading our “metaomics” discussion group this week.  We will be discussing a combination of metatranscriptomics and biogeography and the papers of the week are:

Metatranscriptomics paper:

Microbial community gene expression in ocean surface waters. Frias-Lopez J, Shi Y, Tyson GW, Coleman ML, Schuster SC, Chisholm SW, Delong EF. Proc Natl Acad Sci U S A. 2008 Mar 11;105(10):3805-10. Epub 2008 Mar 3.

Some related papers of potential interest from DeLong

We are also discussing:

Drivers of bacterial beta-diversity depend on spatial scale. Martiny JB, Eisen JA, Penn K, Allison SD, Horner-Devine MC.
Proc Natl Acad Sci U S A. 108(19):7850-4.  (NOTE I am an author on this one – but the meat of the ideas/work was done by Jen Martiny, Claire Horner-Devine and others).

Related papers of possible interest by Jen Martiny and Claire Horner-Divine include:

Will let everyone know how the discussions go.  

Interesting new metagenomics paper w/ one big big big caveat – critical software not available "

Very very strange.  There is an interesting new metagenomics paper that has come out in Science this week.  It is titled “Untangling Genomes from Metagenomes: Revealing an Uncultured Class of Marine Euryarchaeota” and it is from the Armbrust lab at U. Washington.

One of the main points of this paper is that the lab has developed software that apparently can help assemble the complete genomes of organisms that are present in low abundance in a metagenomic sample.  At some point I will comment on the science in the paper, (which seems very interesting) though as the paper in non Open Access I feel uncomfortable doing so since many of the readers of this blog will not be able to read it.

But something else relating to this paper is worth noting and it is disturbing to me.  In a Nature News story on the paper by Virginia Gewin there is some detail about the computational method used in the paper:

“He developed a computational method to break the stitched metagenome into chunks that could be separated into different types of organisms. He was then able to assemble the complete genome of Euryarchaeota, even though it was rare within the sample. He plans to release the software over the next six months.”

What?  It is imperative that software that is so critical to a publication be released in association with the paper.  It is really unacceptable for the authors to say “we developed a novel computational method” and then to say “we will make it available in six months”.  I am hoping the authors change their mind on this but I find it disturbing that Science would allow publication of a paper highlighting a new method and then not have the method be available.  If the methods and results in a paper are not usable how can one test/reproduce the work?

Draft blog post cleanup #2: Metagenomics meets animals

OK – I am cleaning out my draft blog post list.  I start many posts and don’t finish them and then they sit in the draft section of blogger.  Well, I am going to try to clean some of that up by writing some mini posts.  Here is #2:

Saw an interesting story on Genome Web: ‘Denizens’ of the Deep | The Daily Scan | GenomeWeb.  I have not been able to get the original article yet, but it seems that what they have done can basically be considered metagenomics for animals.  They collected sloughed off cells and other material from a lake and surveyed it for animal DNA.  This seems like a very cool derivative of metagenomic approaches and has enormous potential.  But alas, I never got down to getting access to the paper: Monitoring endangered freshwater biodiversity using environmental DNA so this will have to stay as a mini post.  Damn non open access journals …