Crosspost from microBEnet: Some interesting new papers on functional analysius of metagenomics

Crossposting from microBEnet:

Some new papers that may be of interest to people:

Story Behind the Paper: Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads (by Rogan Carr and Elhanan Borenstein)

Here is another post in my “Story Behind the Paper” series where I ask authors of open access papers to tell the story behind their paper.  This one comes from Rogan Carr and Elhanan Borenstein.  Note – this was crossposted at microBEnet.  If anyone out there has an open access paper for which you want to tell the story — let me know.

We’d like to first thank Jon for the opportunity to discuss our work in this forum. We recently published a study investigating direct functional annotation of short metagenomic reads that stemmed from protocol development for our lab. Jon invited us to write a blog post on the subject, and we thought it would be a great venue to discuss some practical applications of our work and to share with the research community the motivation for our study and how it came about.

Our lab, the Borenstein Lab at the University of Washington, is broadly interested in metabolic modeling of the human microbiome (see, for example our Metagenomic Systems Biology approach) and in the development of novel computational methods for analyzing functional metagenomic data (see, for example, Metagenomic Deconvolution). In this capacity, we often perform large-scale analysis of publicly available metagenomic datasets as well as collaborate with experimental labs to analyze new metagenomic datasets, and accordingly we have developed extensive expertise in performing functional, community-level annotation of metagenomic samples. We focused primarily on protocols that derive functional profiles directly from short sequencing reads (e.g., by mapping the short reads to a collection of annotated genes), as such protocols provide gene abundance profiles that are relatively unbiased by species abundance in the sample or by the availability of closely-related reference genomes. Such functional annotation protocols are extremely common in the literature and are essential when approaching metagenomics from a gene-centric point of view, where the goal is to describe the community as a whole.

However, when we began to design our in-house annotation pipeline, we pored over the literature and realized that each research group and each metagenomic study applied a slightly different approach to functional annotation. When we implemented and evaluated these methods in the lab, we also discovered that the functional profiles obtained by the various methods often differ significantly. Discussing these findings with colleagues, some further expressed doubt that that such short sequencing reads even contained enough information to map back unambiguously to the correct function. Perhaps the whole approach was wrong!

We therefore set out to develop a set of ‘best practices’ for our lab for metagenomic sequence annotation and to prove (or disprove) quantitatively that such direct functional annotation of short reads provides a valid functional representation of the sample. We specifically decided to pursue a large-scale study, performed as rigorously as possible, taking into account both the phylogeny of the microbes in the sample and the phylogenetic coverage of the database, as well as several technical aspects of sequencing like base-calling error and read length. We have found this evaluation approach and the results we obtained quite useful for designing our lab protocols, and thought it would be helpful to share them with the wider metagenomics and microbiome research community. The result is our recent paper in PLoS One, Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads.

The performance of BLAST-based annotation of short reads across the bacterial and archaeal tree of life. The phylogenetic tree was obtained from Ciccarelli et al. Colored rings represent the recall for identifying reads originating from a KO gene using the top gene protocol. The 4 rings correspond to varying levels of database coverage. Specifically, the innermost ring illustrates the recall obtained when the strain from which the reads originated is included in the database, while the other 3 rings, respectively, correspond to cases where only genomes from the same species, genus, or more remote taxonomic relationships are present in the database. Entries where no data were available (for example, when the strain from which the reads originated was the only member of its species) are shaded gray. For one genome in each phylum, denoted by a black dot at the branch tip, every possible 101-bp read was generated for this analysis. For the remaining genomes, every 10th possible read was used. Blue bars represent the fraction of the genome's peptide genes associated with a KO; for reference, the values are shown for E. coli, B. thetaiotaomicron, and S. Pneumoniae. Figure and text adapted from: Carr R, Borenstein E (2014) Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads. PLoS ONE 9(8): e105776. doi:10.1371/journal.pone.0105776. See the manuscript for full details.
The performance of BLAST-based annotation of short reads across the bacterial and archaeal tree of life using the ‘top gene’ protocol. See the manuscript for full details. Figure and text adapted from: Carr R, Borenstein E (2014) Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads. PLoS ONE 9(8): e105776 

To perform a rigorous study of functional annotation, we needed a set of reads whose true annotations were known (a “ground truth”). In other words, we had to know the exact locus and the exact genome from which each sequencing read originated and the functional classification associated with this locus. We further wanted to have complete control over technical sources of error. To accomplish this, we chose to implement a simulation scheme, deriving a huge collection of sequence reads from fully sequenced, well annotated, and curated genomes. This schemed allowed us to have complete information about the origin of each read and allowed us to simulate various technical factors we were interested in. Moreover, simulating sequencing reads allowed us to systematically eliminate variations in annotation performance due to technological or biological effects that would typically be convoluted in an experimental setup. For a set of curated genomes, we settled on the KEGG database, as it contained a large collection of consistently functionally curated microbial genomes and it has been widely used in metagenomics for sample annotation. The KEGG hierarchy of KEGG Orthology groups (KOs), Modules, and Pathways could then serve as a common basis for comparative analysis. To control for phylogenetic bias in our results, we sampled broadly across 23 phyla and 89 genera in the bacterial and archaeal tree of life, using a randomly selected strain in KEGG for each tip of the tree from Ciccarelli et al. From each of the selected 170 strains, we generated either *every* possible contiguous sequence of a given length or (in some cases) every 10th contiguous sequence, using a sliding window approach. We additionally introduced various models to simulate sequencing errors. This large collection of reads (totaling ~16Gb) were then aligned to the KEGG genes database using a translated BLAST mapping. To control for phylogenetic coverage of the database (the phylogenetic relationship of the database to the sequence being aligned) we also simulated mapping to many partial collections of genomes. We further used four common protocols from the literature to convert the obtained BLAST alignments to functional annotations. Comparing the resulting annotation of each read to the annotation of the gene from which it originated allowed us to systematically evaluate the accuracy of this annotation approach and to examine the effect of various factors, including read length, sequencing error, and phylogeny.

First and foremost, we confirmed that direct annotation of short reads indeed provides an overall accurate functional description of both individual reads and the sample as a whole. In other words, short reads appear to contain enough information to identify the functional annotation of the gene they originated from (although, not necessarily the specific taxa of origin). Functions of individual reads were identified with high precision and recall, yet the recall was found to be clade dependent. As expected, recall and precision decreased with increasing phylogenetic distance to the reference database, but generally, having a representative of the genus in the reference database was sufficient to achieve a relatively high accuracy. We also found variability in the accuracy of identifying individual KOs, with KOs that are more variable in length or in copy number having lower recall. Our paper includes abundance of data on these results, a detailed characterization of the mapping accuracy across different clades, and a description of the impact of additional properties (e.g., read length, sequencing error, etc.).

A principal component analysis of the pathway abundance profiles obtained for 15 HMP samples and by four different annotation protocols. HMP samples are numbered from 1 to 15 according to the list that appears in the Methods section of the manuscript. The different protocols are represented by color and shape. Note that two outlier protocols for sample 14 are not shown but were included in the PCA calculation. Figure and text adapted from: Carr R, Borenstein E (2014) Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads. PLoS ONE 9(8): e105776. doi:10.1371/journal.pone.0105776. See the manuscript for full details.
A principal component analysis of the pathway abundance profiles obtained for 15 HMP samples and by four different annotation protocols.The different protocols are represented by color and shape. See the manuscript for full details. Figure and text adapted from: Carr R, Borenstein E (2014) Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads. PLoS ONE 9(8): e105776 

Importantly, while the obtained functional annotations are in general representative of the true content of the sample, the exact protocol used to analyze the BLAST alignments and to assign functional annotation to each read could still dramatically affect the obtained profile. For example, in analyzing stool samples from the Human Microbiome Project, we found that each protocol left a consistent “fingerprint” on the resulting profile and that the variation introduced by the different protocols was on the same order of magnitude as biological variation across samples. Differences in annotation protocols are thus analogous to batch effects from variation in experimental procedures and should be carefully taken into consideration when designing the bioinformatic pipeline for a study.

Generally, however, we found that assigning each read with the annotation of the top E-value hit (the ‘top gene’ protocol) had the highest precision for identifying the function from a sequencing read, and only slightly lower recall than methods enriching for known annotations (such as the commonly used ‘top 20 KOs’ protocol). Given our lab interests, this finding led us to adopt the ‘top gene’ protocol for functionally annotating metagenomic samples. Specifically, our work often requires high precision for annotating individual reads for model reconstruction (e.g., utilizing the presence and absence of individual genes) and the most accurate functional abundance profile for statistical method development. If your lab has similar interests, we would recommend this approach for your annotation pipelines. If however, you have different or more specific needs, we encourage you to make use of the datasets we have published along with our paper to help you design your own solution. We would also be very happy to discuss such issues further with labs that are considering various approaches for functional annotation, to assess some of the factors that can impact downstream analyses, or to assist in such functional annotation efforts.

Wanted – opinions/details on online systems for annotation of genomes and metagenomes

Doing a little survey/snooping around.  Trying to compile a list of available online tools for annotating microbial genomes and metagenomes.  And I am also trying to get comments on what people think of the various tools.  There are some obvious candidates to think about

But given that there are certainly many many more out there I decided to post a request to Twitter and Google plus and got some responses.

And from Google Plus where I asked “Researching blog post on free/online microbial genome/metagenome annotation services – looking for examples beyond IMG & RAST “:

Important paper on annotation standards for bacterial/archaeal genomes – readying for the "data deluge"

Interesting paper in the journal “Standards in Genomic Sciences” that is worth checking out for anyone interested in genome sequencing and annotation. The paper is “Solving the Problem: Genome Annotation Standards before the Data Deluge” by William (aka Bill) Klimke et al.

It discusses the development of international annotation standards at NCBI (The National Center for Biotechnology Information) in collaboration with others. Note – the paper is Open Access.

Their abstract:

The promise of genome sequencing was that the vast undiscovered country would be mapped out by comparison of the multitude of sequences available and would aid researchers in deciphering the role of each gene in every organism. Researchers recognize that there is a need for high quality data. However, different annotation procedures, numerous databases, and a diminishing percentage of experimentally determined gene functions have resulted in a spectrum of annotation quality. NCBI in collaboration with sequencing centers, archival databases, and researchers, has developed the first international annotation standards, a fundamental step in ensuring that high quality complete prokaryotic genomes are available as gold standard references. Highlights include the development of annotation assessment tools, community acceptance of protein naming standards, comparison of annotation resources to provide consistent annotation, and improved tracking of the evidence used to generate a particular annotation. The development of a set of minimal standards, including the requirement for annotated complete prokaryotic genomes to contain a full set of ribosomal RNAs, transfer RNAs, and proteins encoding core conserved functions, is an historic milestone. The use of these standards in existing genomes and future submissions will increase the quality of databases, enabling researchers to make accurate biological discoveries.

The paper refers extensively to workshops held by NCBI on genome annotation and gives a link to a page from NCBI with additional information about these workshops.

Now – never mind the extensive use of the term prokaryote in the paper … the paper has got a wealth of information and tidbits worth checking out.

For example the paper has a nice table on annotation tools and databases and resources.

Among the other sections worth checking out
* Discussion of pseudogene annotation and identification
* Discussion of variation in structural annotation
* Evidence standards
* Functional annotation and naming guidelines

For anyone interested in annotating a genome – and more and more people are these days with the decrease in sequencing costs – this is a must read.

Hmm – How did I miss this paper discussing RFAM, Wikipedia & community annotation #cool

Edits for Wikipedia articles on RNA families. The cumulative number of edits since 1st January 2007 for the 733 Wikipedia articles that are associated with Rfam entries is shown in black. The total number of edits that were reverted or labeled as vandalism is shown in red. To mid-2010, there were just 106 of these. However, some reverted edits may have been well-intentioned but were deemed inappropriate for Wikipedia. From

Very interesting discussion in this paper on Rfam about community annotation: Rfam: Wikipedia, clans and the “decimal” release. In the paper the authors (inlcuding Alex Bateman, Sean Eddy, Paul Gardner and others) discuss the use of Wikipedia for Community Annotation of biological databases. They report:

Given our positive experiences, we can highly recommend other curation efforts turning to Wikipedia for their annotation

I am not sure how I missed this paper when it came out recently.  But it is definitely worth a look.  The last line hints at future developments

We look forward to working with the wider community to develop these new tools and techniques.

It seems that they have bought into the Wikipedia based annotation system as having enormous potential.  I generally agree though I am not sure how this is best done.   

Testing, testing – why we need more testing like this in genomic informatics & annotation methods

Just got an announcement regarding this challenge:

Automated Function Prediction SIG 2011 featuring the CAFA Challenge: Critical Assessment of Function Annotations | Automated Function Prediction 2011 July 15-16 2011, Vienna, Austria

Here is a description:

CAFA is a community-driven effort. We call upon computational function prediction groups to predict the function of a set of proteins whose true function is sequestered. At the meeting, we will reveal the functions, and discuss the predictions. The CAFA challenge goals are to foster a discussion between annotators, predictors and experimentalists about methodology as quality of functional predictions, as well as the methodology of assessing those predictions. Registration for CAFA starts July 15, 2010 and the CAFA challenge will take place September 15, 2010 through January 15, 2011.See here for more details on how you can enroll in CAFA.

This is near and dear to my heart as I have been working on methods to predict gene function from sequence for some 15 years now.  My first paper on this was in 1995 in which I showed that for genes in multigene families, phylogenetic trees of the gene family could help in predicting functions of uncharacterized members of the gene family.  More specifically, I suggested that the position of an uncharacterized gene in a gene tree relative to characterized genes could be used to predict its function.  I did this for one family in particular – the SNF2 family – but argued that it could be applied to other families.  (I think perhaps it was the first time someone had made this specific argument about using trees to predict function, but am not sure)

I then formalized this idea with a few papers (e.g., here and here) describing a “phylogenomic” approach to predicting function (alas, this is when I invented my first omics word).  And for many years since, I continued to work on functional prediction methods and continue to do so.  When I was at TIGR for eight years I did this both in my own research and helped others with their functional predictions.  I firmly believe that evolutionary approached approaches are critical in such functional prediction and have laid this out in a series of talks and papers (e.g., see this more recent one).

Anyway, enough about me.  I can argue all I want about how brilliant I am and about how evolutionary methods are the best approach.  But arguing is alas not science.  What we need are tests and experiments.  And that is where things like CAFA come in.  In CAFA one can test how well various functional prediction methods work.  And the people involved in CAFA (including organizers  Iddo FriedbergMichal Linial, and Predrag Radivojac and others such as Amos Bairoch, Sean Mooney, Patricia Babbitt, Steven Brenner, Christine Orengo and Burkhard RoshRost)) are to be commended for putting this together because we do not have a lot of these activities and need more in all aspects of genomics (and metagenomics too).  Others have discussed doing tests of functional prediction methods before, but I am not sure if any have happened per se.

Have a favorite functional prediction method?  Enter it in the competition or give a talk on it.  And if you are feeling inspired, organize a similar activity in your area of science – testing is a good thing.

See also Iddo Friedberg’s post about this