Painful, kind of hilarious, typesetting error in a #PLOSOne paper from my lab – should I try to get it fixed?

Saw this tweet:

So I went and found the paper.

Wu D, Jospin G, Eisen JA (2013) Systematic Identification of Gene Families for Use as “Markers” for Phylogenetic and Phylogeny-Driven Ecological Studies of Bacteria and Archaea and Their Major Subgroups. PLoS ONE 8(10): e77033. doi:10.1371/journal.pone.0077033

And discovered what he was pointing to

Then I looked at the Pubmed Central version of the paper and it was the same.  So I wnet and found the arXiv version of the paper and it looked correct.

So apparently, PLOS One somehow replaced “nonmember” with

Brutal.  And even worse, this may have been there all along and I missed it.  So I responded:

// And then Michael Hall pointed out another mistake.




Aaargh.  And funny too.  So now the question I guess is – should I fix it? And if so, how do I do that?


Story Behind the Paper: Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads (by Rogan Carr and Elhanan Borenstein)

Here is another post in my “Story Behind the Paper” series where I ask authors of open access papers to tell the story behind their paper.  This one comes from Rogan Carr and Elhanan Borenstein.  Note – this was crossposted at microBEnet.  If anyone out there has an open access paper for which you want to tell the story — let me know.

We’d like to first thank Jon for the opportunity to discuss our work in this forum. We recently published a study investigating direct functional annotation of short metagenomic reads that stemmed from protocol development for our lab. Jon invited us to write a blog post on the subject, and we thought it would be a great venue to discuss some practical applications of our work and to share with the research community the motivation for our study and how it came about.

Our lab, the Borenstein Lab at the University of Washington, is broadly interested in metabolic modeling of the human microbiome (see, for example our Metagenomic Systems Biology approach) and in the development of novel computational methods for analyzing functional metagenomic data (see, for example, Metagenomic Deconvolution). In this capacity, we often perform large-scale analysis of publicly available metagenomic datasets as well as collaborate with experimental labs to analyze new metagenomic datasets, and accordingly we have developed extensive expertise in performing functional, community-level annotation of metagenomic samples. We focused primarily on protocols that derive functional profiles directly from short sequencing reads (e.g., by mapping the short reads to a collection of annotated genes), as such protocols provide gene abundance profiles that are relatively unbiased by species abundance in the sample or by the availability of closely-related reference genomes. Such functional annotation protocols are extremely common in the literature and are essential when approaching metagenomics from a gene-centric point of view, where the goal is to describe the community as a whole.

However, when we began to design our in-house annotation pipeline, we pored over the literature and realized that each research group and each metagenomic study applied a slightly different approach to functional annotation. When we implemented and evaluated these methods in the lab, we also discovered that the functional profiles obtained by the various methods often differ significantly. Discussing these findings with colleagues, some further expressed doubt that that such short sequencing reads even contained enough information to map back unambiguously to the correct function. Perhaps the whole approach was wrong!

We therefore set out to develop a set of ‘best practices’ for our lab for metagenomic sequence annotation and to prove (or disprove) quantitatively that such direct functional annotation of short reads provides a valid functional representation of the sample. We specifically decided to pursue a large-scale study, performed as rigorously as possible, taking into account both the phylogeny of the microbes in the sample and the phylogenetic coverage of the database, as well as several technical aspects of sequencing like base-calling error and read length. We have found this evaluation approach and the results we obtained quite useful for designing our lab protocols, and thought it would be helpful to share them with the wider metagenomics and microbiome research community. The result is our recent paper in PLoS One, Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads.

The performance of BLAST-based annotation of short reads across the bacterial and archaeal tree of life. The phylogenetic tree was obtained from Ciccarelli et al. Colored rings represent the recall for identifying reads originating from a KO gene using the top gene protocol. The 4 rings correspond to varying levels of database coverage. Specifically, the innermost ring illustrates the recall obtained when the strain from which the reads originated is included in the database, while the other 3 rings, respectively, correspond to cases where only genomes from the same species, genus, or more remote taxonomic relationships are present in the database. Entries where no data were available (for example, when the strain from which the reads originated was the only member of its species) are shaded gray. For one genome in each phylum, denoted by a black dot at the branch tip, every possible 101-bp read was generated for this analysis. For the remaining genomes, every 10th possible read was used. Blue bars represent the fraction of the genome's peptide genes associated with a KO; for reference, the values are shown for E. coli, B. thetaiotaomicron, and S. Pneumoniae. Figure and text adapted from: Carr R, Borenstein E (2014) Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads. PLoS ONE 9(8): e105776. doi:10.1371/journal.pone.0105776. See the manuscript for full details.
The performance of BLAST-based annotation of short reads across the bacterial and archaeal tree of life using the ‘top gene’ protocol. See the manuscript for full details. Figure and text adapted from: Carr R, Borenstein E (2014) Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads. PLoS ONE 9(8): e105776 

To perform a rigorous study of functional annotation, we needed a set of reads whose true annotations were known (a “ground truth”). In other words, we had to know the exact locus and the exact genome from which each sequencing read originated and the functional classification associated with this locus. We further wanted to have complete control over technical sources of error. To accomplish this, we chose to implement a simulation scheme, deriving a huge collection of sequence reads from fully sequenced, well annotated, and curated genomes. This schemed allowed us to have complete information about the origin of each read and allowed us to simulate various technical factors we were interested in. Moreover, simulating sequencing reads allowed us to systematically eliminate variations in annotation performance due to technological or biological effects that would typically be convoluted in an experimental setup. For a set of curated genomes, we settled on the KEGG database, as it contained a large collection of consistently functionally curated microbial genomes and it has been widely used in metagenomics for sample annotation. The KEGG hierarchy of KEGG Orthology groups (KOs), Modules, and Pathways could then serve as a common basis for comparative analysis. To control for phylogenetic bias in our results, we sampled broadly across 23 phyla and 89 genera in the bacterial and archaeal tree of life, using a randomly selected strain in KEGG for each tip of the tree from Ciccarelli et al. From each of the selected 170 strains, we generated either *every* possible contiguous sequence of a given length or (in some cases) every 10th contiguous sequence, using a sliding window approach. We additionally introduced various models to simulate sequencing errors. This large collection of reads (totaling ~16Gb) were then aligned to the KEGG genes database using a translated BLAST mapping. To control for phylogenetic coverage of the database (the phylogenetic relationship of the database to the sequence being aligned) we also simulated mapping to many partial collections of genomes. We further used four common protocols from the literature to convert the obtained BLAST alignments to functional annotations. Comparing the resulting annotation of each read to the annotation of the gene from which it originated allowed us to systematically evaluate the accuracy of this annotation approach and to examine the effect of various factors, including read length, sequencing error, and phylogeny.

First and foremost, we confirmed that direct annotation of short reads indeed provides an overall accurate functional description of both individual reads and the sample as a whole. In other words, short reads appear to contain enough information to identify the functional annotation of the gene they originated from (although, not necessarily the specific taxa of origin). Functions of individual reads were identified with high precision and recall, yet the recall was found to be clade dependent. As expected, recall and precision decreased with increasing phylogenetic distance to the reference database, but generally, having a representative of the genus in the reference database was sufficient to achieve a relatively high accuracy. We also found variability in the accuracy of identifying individual KOs, with KOs that are more variable in length or in copy number having lower recall. Our paper includes abundance of data on these results, a detailed characterization of the mapping accuracy across different clades, and a description of the impact of additional properties (e.g., read length, sequencing error, etc.).

A principal component analysis of the pathway abundance profiles obtained for 15 HMP samples and by four different annotation protocols. HMP samples are numbered from 1 to 15 according to the list that appears in the Methods section of the manuscript. The different protocols are represented by color and shape. Note that two outlier protocols for sample 14 are not shown but were included in the PCA calculation. Figure and text adapted from: Carr R, Borenstein E (2014) Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads. PLoS ONE 9(8): e105776. doi:10.1371/journal.pone.0105776. See the manuscript for full details.
A principal component analysis of the pathway abundance profiles obtained for 15 HMP samples and by four different annotation protocols.The different protocols are represented by color and shape. See the manuscript for full details. Figure and text adapted from: Carr R, Borenstein E (2014) Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads. PLoS ONE 9(8): e105776 

Importantly, while the obtained functional annotations are in general representative of the true content of the sample, the exact protocol used to analyze the BLAST alignments and to assign functional annotation to each read could still dramatically affect the obtained profile. For example, in analyzing stool samples from the Human Microbiome Project, we found that each protocol left a consistent “fingerprint” on the resulting profile and that the variation introduced by the different protocols was on the same order of magnitude as biological variation across samples. Differences in annotation protocols are thus analogous to batch effects from variation in experimental procedures and should be carefully taken into consideration when designing the bioinformatic pipeline for a study.

Generally, however, we found that assigning each read with the annotation of the top E-value hit (the ‘top gene’ protocol) had the highest precision for identifying the function from a sequencing read, and only slightly lower recall than methods enriching for known annotations (such as the commonly used ‘top 20 KOs’ protocol). Given our lab interests, this finding led us to adopt the ‘top gene’ protocol for functionally annotating metagenomic samples. Specifically, our work often requires high precision for annotating individual reads for model reconstruction (e.g., utilizing the presence and absence of individual genes) and the most accurate functional abundance profile for statistical method development. If your lab has similar interests, we would recommend this approach for your annotation pipelines. If however, you have different or more specific needs, we encourage you to make use of the datasets we have published along with our paper to help you design your own solution. We would also be very happy to discuss such issues further with labs that are considering various approaches for functional annotation, to assess some of the factors that can impact downstream analyses, or to assist in such functional annotation efforts.

Guest post by Jay Kaufman: A Bad Taste That Keeps Not Getting Any Better….

Guest post by Jay Kaufman.  Jay and I have been having some email discussions about a paper in PLOS One.  I offered to let him write a guest post to my blog about his concerns.


Jonathan Eisen already posted on this blog about a PLoS ONE paper by Mason et.. published on 23 October 2013.  And he posted related comments on the PLoS ONE website.  I also commented at this site, in reference to his comments and the authors’ response. The purpose of my comments here are just to review those concerns and comment additionally on the PLos ONE response and what this means for the journal’s publication model and the progress of science.

The paper by Mason and colleagues analyzes data on 48 people in each of 4 self-identified ethnic groups (African American, Caucasian, Chinese, and Latino). These study subjects are apparently volunteers, and the paper only states that they are non-smokers over 18 years old who are free of a list of diagnosed diseases and who have not recently had their teeth cleaned. Based on the text of the published paper, there is no consideration of their age, diet, social class, or even gender. The authors culture bacterial species from the study subjects and process the data through an algorithm that maximizes the prediction of racial group membership based on these measured data.

The prediction is moderately successful, but this could result from any number of unsurprising reasons. For example, if alcohol consumption affects some particular bacterial species, and whites drink more than Asians in central Ohio, then whatever species is diminished by alcohol exposure would help predict that a sample was from a white rather than from an Asian volunteer. And likewise for any of a million possible lifestyle, social class and demographic differences.  In fact, this is a general problem with data mining exercises that Lazer et al describe in the current issue of Science.

The authors provide no information about how balanced this sample is with respect to any of these variables. Maybe the 48 Hispanics are younger than the 48 whites on average, or have more tooth decay or eat more refined sugar or any of a million other possibilities. The fact that these countless potentially imbalanced factors get represented in the oral bio-environment hardly seems surprising, and the fact that these behaviors and exposures might be differential by race is an observation that is completely trivial from a sociological perspective.

My concern here, however, the authors assert in the published text that these differences do not arise from any of these myriad environmental factors, but from some innate genetic characteristics of the groups. In the Discussion section on page 3 they state that “ethnicity exerts a selection pressure on the oral microbiome, and…this selection pressure is genetic rather than environmental, since the two ethnicities that shared a common food, nutritional and lifestyle heritage (Caucasians and African Americans) demonstrated significant microbial divergence.” Here is a remarkable statement, that Caucasians and African Americans experience no differential dietary or lifestyle factors. It is directly contradicted by thousands of published papers in sociology, epidemiology and anthropology that document these differences for reasons of culture, geographic origin, social class and discrimination.

Jonathan Eisen’s post directly confronted the authors on this point, and they responded with the following explanation:

“Subjects were selected based on extensive questionnaire surveys and clinical examinations to ensure homogeneity. These questionnaires evaluated educational level, socio-economic status, diet and nutritional history, systemic health status, oral hygiene habits and dental visits, among other things.”

This is surely an important statement about the research design, but the problem is that it appears nowhere in the peer-reviewed text of the published paper. What exactly do the authors mean when they insist that the study subjects were perfectly balanced on factors such as socio-economic status and nutritional history? These complex social and lifestyle variables are notoriously difficult to define and measure. While the authors describe the laboratory techniques in baroque detail, they do not even mention in the published paper that they measured these factors, let alone how these variables were defined and considered in the analysis. This represents a profound limitation for the reader in assessing the validity of these measures and adjustments, and therefore the adequacy of the claimed “homogeneity”. The complete omission of these crucial aspects of the analysis in the paper prevents the reader from investing much confidence in the boldly stated claim that observed differences are “genetic rather than environmental” in origin.

I expressed these concerns in my own post at the journal website on 14 November 2013, but the authors did not respond.  Therefore, at Jonathan’s suggestion, I addressed this concern to the PLoS ONE editors in an e-mail on 22 November 2013. I got passed along from one editor to another, and finally I got a very nice response from Elizabeth Silva on 4 December 2013. She wrote:

I wanted to let you know that I am discussing this article and your concerns with both the Academic Editor and the authors, as well as with my colleagues. We take such concerns very seriously and will ensure that appropriate measures are taken to correct any errors or discrepancies. 

Then I waited.  After 2 months I had heard nothing, so I wrote to Dr. Silva again asking for any word on progress, but received no reply.  So I waited another month.

On 4 March it had been 3 months since the note from PLoS ONE promising appropriate measures to correct any errors or discrepancies, so I wrote again, this time a bit more insistently.  This did message did finally generate a quick and reassuring response from Dr. Silva:

I really am very sorry for the extended delay in replying to you, and for my neglect in providing you with an update. Following your correspondence I contacted the authors to ask them for additional information relating to their statement that they corrected for confounding factors, and details of these methods. They promptly replied with a table of details of the baseline variables that they corrected for, and that they described in the comment on their article (see attached), as well as an additional correction to one of their figure legends. I then contacted the Academic Editor, Dr. Indranil Biswas with the full details of your concerns, as well as the table sent by the authors and the correction they requested for their figure legend. We asked Dr. Biswas to revisit the manuscript, in light of this new information, and he has informed us that he feels the conclusions of the manuscript are sound. We will now work with the authors to draft and issue a formal correction to the published article to update the methods to include the table, and to amend the figure legend in question.

The table that Dr. Silva forwarded displayed a list of variables and a p-value for some kind of test between the values in the 4 race groups. The test is not specified (t-test? chi-square test?) but presumably it is for any difference in means or proportions between the 4 groups.  Most of the p-values are large, indicating little evidence for any difference between the groups in income, age, education, or frequency of tooth-brushing, etc.  Based on this table, the populations differed only in their diets, which were characterized as “Asian Diet”, “Hispanic Diet” and “American Diet”.  Not unsurprisingly, the Asians were significantly more likely to report an “Asian Diet” and the Hispanics were significantly more likely to report a “Hispanic Diet”.  The Blacks and Whites had similar reported consumption of the “American Diet”, which presumably was the basis for the authors’ assertion that these groups have identical social environments.

To date, there has been no correction made to the Mason et al paper at the PLoS ONE website.  Therefore it is perhaps somewhat premature to speculate on how the authors will address the concern voiced by Jonathan Eisen’s posted comment and blog post that balance across a handful of measured covariates does not in any way imply balance across all relevant factors except for genetics. Indeed, it has long been argued in the epidemiology literature that one cannot make indirect inferences about genes by measuring and adjusting for a few environmental exposures and attributing all remaining differences to genes.  The argument that Blacks and Whites in Ohio experience identical environments is clearly false, even if a handful of measured covariates are not significantly different in their small convenience sample, the exact origin of which is still obscure.

There are many observations that can be made from this episode. I offer just a few:

  1. These authors are assiduous in describing their lab techniques, but regarding the study design and analysis they are quite cavalier.  Presumably the reviewers were not population scientists, and so they failed to point out these embarrassing flaws. This raises the question of whether a multidisciplinary journal such as PLoS ONE has the relevant expertise to screen out scientifically invalid papers. The fact that Dr. Silva suggested that the authors’ table of covariates and p-values solves the issue demonstrates a wide gap of understanding.  Specialty journals that handle a narrow disciplinary range are not faced with this kind of crisis of competence.
  2. PLoS ONE is such a large operation with so many papers, that quality control seems to suffer. These beleaguered editors are responsible for an enormous publishing volume. Has quantity overwhelmed quality to the extent that gross errors of logic slip through? Months later, the Mason paper has been accessed thousands of times and generated a great deal of media attention, and yet no correction or erratum has appeared, despite the fact that the authors freely admit that the methods in their published paper are not accurate. 
  3. The publishing model gives PLoS ONE a big incentive (almost $2000) to accept a paper, but once it is published, little incentive to correct or withdraw it. 

Sadly, this is not an isolated example.  This week, PLoS ONE published a paper by Wikoff et al which makes a similar logical gaffe about observed racial difference proving a genetic difference.  We could post a comment online, but it seems that nobody (neither the authors nor the editors) has much time to spend monitoring such comments, nor much incentive to care about them.  The authors have their publication, the journal has its $2000, and another tiny piece of horrific misinformation has been released into the world.  The basic philosophy of PLoS ONE is to reduce the gate-keeper role of scientific publication. I am starting to become convinced that a little gate-keeping is not such a bad idea.

Bad Ome-like word of the week: symbiome

Well I got pointed to this paper: Transgenerational Transmission of the Glossina pallidipes Hytrosavirus Depends on the Presence of a Functional Symbiome

And as many might guess – the word “symbiome” did not sit well with me.  Alas, they don’t define it in the paper.  So I can’t really quibble with their definition.  But I did find some other stuff out there that, well, at least helps see how other people are using the word:

I can’t really tell from most of these if “symbiome” can be a useful term or not sometimes.  Certainly the iPhylo example above has potential.  But in general, the word seems awkward at best.  Now – as far as I can tell, nobody is using it in the context of “genomics” so this does not fit in with my “badomics” obsession.  But it still does not make me feel warm and fuzzy so I am going to give it a pseudo-award – the Bad Ome-like word award.

If you are into microbial diversity and use R this may be worth checking out: phyloseq

Just got pointed (by the lead author) to this new paper: PLOS ONE: phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data


The analysis of microbial communities through DNA sequencing brings many challenges: the integration of different types of data with methods from ecology, genetics, phylogenetics, multivariate statistics, visualization and testing. With the increased breadth of experimental designs now being pursued, project-specific statistical analyses are often needed, and these analyses are often difficult (or impossible) for peer researchers to independently reproduce. The vast majority of the requisite tools for performing these analyses reproducibly are already implemented in R and its extensions (packages), but with limited support for high throughput microbiome census data.
Here we describe a software project, phyloseq, dedicated to the object-oriented representation and analysis of microbiome census data in R. It supports importing data from a variety of common formats, as well as many analysis techniques. These include calibration, filtering, subsetting, agglomeration, multi-table comparisons, diversity analysis, parallelized Fast UniFrac, ordination methods, and production of publication-quality graphics; all in a manner that is easy to document, share, and modify. We show how to apply functions from other R packages to phyloseq-represented data, illustrating the availability of a large number of open source analysis techniques. We discuss the use of phyloseq with tools for reproducible research, a practice common in other fields but still rare in the analysis of highly parallel microbiome census data. We have made available all of the materials necessary to completely reproduce the analysis and figures included in this article, an example of best practices for reproducible research.
The phyloseq project for R is a new open-source software package, freely available on the web from both GitHub and Bioconductor.

Seems similar in some ways to the WATERs Kepler Workflow that we released a few years ago.   Anyway – if you use R and are into microbial diversity studies this may be worth checking out.  As a bonus – it has a strong emphasis on reproducibility – which is a good thing.

PLoSOne paper: Parallel polymorphisms for pepper population phylogenetics, from #UCDavis, not #Pepperspray

Interesting new paper from colleagues at UC Davis: PLOS ONE: Characterization of Capsicum annuum Genetic Diversity and Population Structure Based on Parallel Polymorphism Discovery with a 30K Unigene Pepper GeneChip.

Press release is here:

Good to see something on peppers from UC Davis not about spraying.

A title and figures say it all: An In-Depth Analysis of a Piece of Shit (note – a bit gross)

I had heard about this paper a few weeks ago and was just reminded about it by a friend. To understand this paper all you really need are the title and Figure 1 and 2 which are below. Enjoy.

An In-Depth Analysis of a Piece of Shit: Distribution of Schistosoma mansoni and Hookworm Eggs in Human Stool.

Figure 1. Instruction form on how to collect whole-stool samples for the study

Figure 2. Processing of stool samples according to consistency including whole-stool homogenization.  Sausage-shaped-but-soft samples were processed like in (A) without taking samples from the center.

Important & neglected aspect of lab studies of animals : effect of habitat change on microbiome

By Aaron Logan via Wikipedia 

Very very interesting paper came out recent from some colleagues at UC Davis: PLOS ONE: Routine Habitat Change: A Source of Unrecognized Transient Alteration of Intestinal Microbiota in Laboratory Mice

Abstract: The mammalian intestine harbors a vast, complex and dynamic microbial population, which has profound effects on host nutrition, intestinal function and immune response, as well as influence on physiology outside of the alimentary tract. Imbalance in the composition of the dense colonizing bacterial population can increase susceptibility to various acute and chronic diseases. Valuable insights on the association of the microbiota with disease critically depend on investigation of mouse models. Like in humans, the microbial community in the mouse intestine is relatively stable and resilient, yet can be influenced by environmental factors. An often-overlooked variable in research is basic animal husbandry, which can potentially alter mouse physiology and experimental outcomes. This study examined the effects of common husbandry practices, including food and bedding alterations, as well as facility and cage changes, on the gut microbiota over a short time course of five days using three culture-independent techniques, quantitative PCR, terminal restriction fragment length polymorphism (TRFLP) and next generation sequencing (NGS). This study detected a substantial transient alteration in microbiota after the common practice of a short cross-campus facility transfer, but found no comparable alterations in microbiota within 5 days of switches in common laboratory food or bedding, or following an isolated cage change in mice acclimated to their housing facility. Our results highlight the importance of an acclimation period following even simple transfer of mice between campus facilities, and highlights that occult changes in microbiota should be considered when imposing husbandry variables on laboratory animals.

I personally think that we as a community are going to have to come to grips with the fact that the microbial communities in / on research organisms (of all kinds) may have a profound effect on experimental results.  This may explain many of the differences seen in experiments between facilities or over time within a facility.  In general, I think either controlling the microbes more carefully in lab experiments (e.g., using defined flora) or at least monitoring them is going to be very important to best interpret studies of plants and animals in the lab (or for that matter – in the field too).  Anyway -this paper is a tiny window into one of the ways that controlling for microbiomes may be important in lab studies.

Citation: Ma BW, Bokulich NA, Castillo PA, Kananurak A, Underwood MA, et al. (2012) Routine Habitat Change: A Source of Unrecognized Transient Alteration of Intestinal Microbiota in Laboratory Mice. PLoS ONE 7(10): e47416. doi:10.1371/journal.pone.0047416

PCR amplification and pyrosequencing of rpoB as complement to rRNA

Figure 1. Number of OTUs as
 a function of fractional sequence difference
 (OTU cut-off) for the 16S rRNA marker
 gene (A) and the rpoB marker gene (B).

Interesting new paper in PLoS One: PLoS ONE: A Comparison of rpoB and 16S rRNA as Markers in Pyrosequencing Studies of Bacterial Diversity

In the paper they test and use PCR amplification and pyrosequencing of the rpoB gene for studies of the diversity of bacteria. Due to the lower level of conservation of rpoB than rRNA genes at the DNA level they focused on proteobacteria. It seems that with a little perseverance once can get PCR for protein coding genes to work reasonably well for even reasonably broad taxonomic groups (not totally new here but I am not aware of too many papers doing this with pyrosequencing). Anyway, the paper is worth a look.

 Vos M, Quince C, Pijl AS, de Hollander M, Kowalchuk GA (2012) A Comparison of rpoB and 16S rRNA as Markers in Pyrosequencing Studies of Bacterial Diversity. PLoS ONE 7(2): e30600. doi:10.1371/journal.pone.0030600 Vos, M., Quince, C., Pijl, A., de Hollander, M., & Kowalchuk, G. (2012). A Comparison of rpoB and 16S rRNA as Markers in Pyrosequencing Studies of Bacterial Diversity PLoS ONE, 7 (2) DOI: 10.1371/journal.pone.0030600

To be discussed in journal club here today: eukaryotic metatranscriptomics

Going to be discussing this in a journal club today

Will post some comments / notes later / as I have them but any comments from others welcome too …
PLoS ONE: Metatranscriptomics Reveals the Diversity of Genes Expressed by Eukaryotes in Forest Soils

Note – am copying the whole text here to mark it up a bit since it it awkward to try and mark it up at the PLoS One site

Metatranscriptomics Reveals the Diversity of Genes Expressed by Eukaryotes in Forest Soils

Coralie Damon1Frédéric Lehembre1,Christine Oger-Desfeux2Patricia Luis1Jacques Ranger3Laurence Fraissinet-Tachet1Roland Marmeisse1*
1 Ecologie Microbienne, UMR CNRS 5557, USC INRA 1193, Université de Lyon, Université Lyon 1, Villeurbanne, France, 2 Pôle Rhône-Alpes de Bioinformatique, Université de Lyon, Université Lyon 1, Villeurbanne, France, 3 Biogéochimie des Ecosystèmes Forestiers, INRA centre de Nancy, Champenoux, France

Abstract Top

Eukaryotic organisms play essential roles in the biology and fertility of soils. For example the micro and mesofauna contribute to the fragmentation and homogenization of plant organic matter, while its hydrolysis is primarily performed by the fungi. To get a global picture of the activities carried out by soil eukaryotes we sequenced 2×10,000 cDNAs synthesized from polyadenylated mRNA directly extracted from soils sampled in beech (Fagus sylvatica) and spruce (Picea abies) forests. Taxonomic affiliation of both cDNAs and 18S rRNA sequences showed a dominance of sequences from fungi (up to 60%) and metazoans while protists represented less than 12% of the 18S rRNA sequences. Sixty percent of cDNA sequences from beech forest soil and 52% from spruce forest soil had no homologs in the GenBank/EMBL/DDJB protein database. A Gene Ontology term was attributed to 39% and 31.5% of the spruce and beech soil sequences respectively. Altogether 2076 sequences were putative homologs to different enzyme classes participating to 129 KEGG pathways among which several were implicated in the utilisation of soil nutrients such as nitrogen (ammonium, amino acids, oligopeptides), sugars, phosphates and sulfate. Specific annotation of plant cell wall degrading enzymes identified enzymes active on major polymers (cellulose, hemicelluloses, pectin, lignin) and glycoside hydrolases represented 0.5% (beech soil)–0.8% (spruce soil) of the cDNAs. Other sequences coding enzymes active on organic matter (extracellular proteases, lipases, a phytase, P450 monooxygenases) were identified, thus underlining the biotechnological potential of eukaryotic metatranscriptomes. The phylogenetic affiliation of 12 full-length carbohydrate active enzymes showed that most of them were distantly related to sequences from known fungi. For example, a putative GH45 endocellulase was closely associated to molluscan sequences, while a GH7 cellobiohydrolase was closest to crustacean sequences, thus suggesting a potentially significant contribution of non-fungal eukaryotes in the actual hydrolysis of soil organic matter.

Introduction Top

In terrestrial ecosystems, plant litter degradation is a key ecological feature which controls not only the equilibrium between soil carbon storage and CO2 release in the atmosphere, but also the release of essential soil nutrients trapped in dead plant biomass such as organic forms of phosphorus and nitrogen.
In ecological studies, litter degradation is often estimated by measuring parameters such as soil respiration [1], litter mass loss [2] or the activities of specific microbial enzymes in soil extracts [3]. In microbiology, the degradation of plant-derived compounds such as lignocellulose has been studied using a few microbial model species and has recently led to the sequencing of the genomes of different saprotrophic fungal species which use different strategies to degrade plant material [4][7], thus revealing the full enzymatic machinery implicated in this process.
Under natural conditions, litter degradation is generally carried out by consortia of species that either act simultaneously or replace one another on a common piece of plant debris in a sometimes predictable manner and not by a single microbial species [8]. It can therefore be anticipated that the molecular machinery deployed to completely mineralize litter in the field is far more complex and diverse than the machinery observed in a single microbial genome. In addition, it is likely that the diversity of this machinery is partly controlled by litter chemistry and complexity and therefore by plant community composition. Indeed, by affecting litter quality and soil physicochemical properties, plant cover could influence microbial community composition and/or diversity [9][10], and select different microbial taxa that may possess different degradation machineries.
By allowing access to the genome contents of the different microorganisms present in a common environment (metagenomics) or to the set of genes they express (metatranscriptomics), environmental genomics offers a novel opportunity to decipher at the molecular level, complex ecological processes such as plant organic matter degradation, thus bridging the gap between global field measurements and targeted genomic approaches. These approaches have been carried out on a variety of animal digestive systems [11][17] or model composts [18]and DNA and/or RNA sequencing of their associated microbial communities has lead to the characterization of numerous members of carbohydrate-active enzyme families whose frequencies differ depending on the origin of the sequenced metagenomic DNA/RNA [11],[15].
Since the primary degradation of lignocellulosic materials in soil is believed to be carried out essentially by fungi, we present a first metatranscriptomic analysis of forest soils focussing specifically on eukaryotic organisms whose polyadenylated mRNA can be separated from prokaryotic and non coding ones, thanks to their poly-A tails [13],[19][20]. Poly-A mRNA were directly extracted from beech (Fagus sylvatica, a broadleaf deciduous angiosperm species) and spruce (Picea abies, an evergreen gymnosperm species) forest soils, converted into cDNA then cloned and 10,000 of them per library sequenced using the Sanger technology. The sequence datasets were specifically analysed with respect to the diversity of enzymes implicated in the breakdown of lignocellulosic material and downstream monosaccharide membrane transporters, as well as enzymes and transporters involved in the mobilization of the organic forms of other elements (phosphorus, nitrogen). In addition, since many microbial enzymes produced for soil nutrient acquisition, and/or toxic soil compounds inactivation (e.g. P450 monooxygenases) are also of biotechnological interest, we performed a global evaluation of the sequence dataset with respect to its biotechnological potential.

Results Top

Sequence datasets (Table 1)


Table 1. Characteristics of the sequence datasets.


cDNA libraries of ~9 105 and 5 105 plasmid clones were obtained using poly-A mRNA (NOTE – THEIR USE OF polyA FOCUSED PREP WILL DEFINITELY BIAS THIS IN FAVOR OF EUKS THOUGH WILL ALSO MISS SOME THIGNS)  extracted from beech and spruce soils, respectively. A significant proportion of cDNA inserts were between 400 and 700 bp-long, as estimated by agarose gel electrophoresis (64% and 48% of cDNA for beech and spruce respectively). Following sequencing and cleaning of the sequences (removing bad quality sequences, poly-A tails, vector and adaptor sequences as well as contaminating rRNA sequences (NOTE – NOT SURE WHY THEY REMOVED rRNA), and discarding sequences shorter than 100 nt), 7905 (beech) and 8606 (spruce) cDNAs were retained for further analyses. Contamination levels by eukaryotic and bacterial rRNA sequences represented at most 8.8% of the sequences (beech). These rRNA sequences could have been cloned because they processed A-rich sequence stretches, and therefore may not reflect the taxonomic diversity of the soil microbial communities. Finally the average read length was 430 bp for beech and 482 bp for spruce cDNAs.
For spruce and beech cDNA sequence datasets, 9% of the sequences grouped into clusters which, for more than 70% of them, contained only two sequences (Figure S1 B–C). As a result, the rarefaction curves plotting the numbers of clusters versus the numbers of sequences had the shapes of straight lines (Figure S1 A) suggesting that the datasets represented a small proportion of the total metatranscriptome sequence diversities. Only two of the 12 clusters grouping at least 8 sequences corresponded to known genes. One, from the beech library, encoded a fungal hydrophobin, a class of proteins which coat the external surface of hyphae and spores, and the other, from the spruce library, a putative chitinase.

Taxonomic diversity of the soil eukaryotic communities

For each forest soil, the taxonomic composition of the eukaryotic community contributing to the soil RNA pool was independently estimated using two sequence datasets: (i) 18S rDNA fragments PCR amplified from reverse-transcribed soil RNA (Table S1), and (WHY NOT ALSO LOOK AT THE rRNAs THEY RECOVERED IN THE cDNA LIBS?) (ii) the sequenced cDNAs. Sequences were assigned to one of eight major eukaryotic phyla, i.e. the Fungi, Metazoas, Plantae, Amoebozoas, Alveolata, Heterokonta, Rhizarias and Excavata (as defined in [21]). The last 5 phyla are collectively referred as the “protists” in the manuscript. NOT SURE THIS IS A GOOD USE OF THE TERM.
Whatever the sequence dataset and the studied soil, the Opisthokont group (i.e. essentially Fungi and Metazoas) dominated, representing 71% (beech) and 77% (spruce) of the unambiguously annotated PCR amplified 18S rDNA and 60–61% of the annotated cDNAs (Figure 1 andTable S2). Furthermore, among the Opisthokonts, fungal sequences were the most abundant, representing between 75 and 87% of the cDNAs attributed to this group.

Figure 1. Taxonomic affiliation of the different environmental sequences.
PCR amplified 18S rRNA (column 1) and cDNAs (columns 2 and 3) from spruce and beech soils were attributed to one of eight eukaryotic phyla or to the Bacteria. The “multiple affiliation” category contains sequences which could not be unambiguously placed in one of the other categories. (A), Results for all sequences which had a homolog in the GenBank/EMBL/DDJB database; (B), same analysis after removing cDNA sequences from bacteria and cDNA with uncertain “multiple affiliation” for a direct comparison between 18S rRNA and cDNA datasets. For the annotation of cDNA sequences using MEGAN, two different parameters were used; the “stringent analysis” (column 2) corresponded to the “Min Support” parameter set to default value of 5 which was set to 1 in the “less stringent” analysis (column 3).


Sequences from each “protist” group were identified at least once in each of the different beech or spruce sequence datasets and protist sequences contributed to the different datasets between 3% (cDNAs) and 12% (amplified 18S rDNA) of the unambiguously annotated sequences. It should however be stressed that the use of a single primer pair to amplify eukaryotic rDNA genes is known to capture only a fraction of the true diversity [22] and that amplification of 18S sequences from several groups of unicellular eukaryotes requires group-specific primers [23].
Using the default parameters and the BLASTX results against the NCBI-nr protein database as input file, the MEGAN software attributed a taxonomic annotation to 46% (spruce) and 37% (beech) of the cDNAs. Among these annotated cDNAs, only 2.5% were attributed to “protists” and 22–29% could not be unambiguously attributed to a single taxonomic group (multiple affiliation category, Figure 1). We asked if under-representation of specific taxonomic groups, such as the protists, and a high prevalence of sequences with multiple potential affiliations may have resulted from the settings of the MEGAN software. Indeed, using default parameters, MEGAN could take into consideration a too high number of BLAST hits (a maximum of five) to infer a taxonomic affiliation thus underestimating the abundance of taxonomic groups poorly represented in sequence databases. This hypothesis was tested by lowering the “Min Support” parameter from 5 to 1 (relaxed setting). Although this modification increased the number of sequences attributed to protists by 50% (which still represented at most 3.5% of the cDNAs); this did not change the overall taxonomic distribution of the cDNAs (Figure 1).

Global functional annotation of the cDNAs

Using an e-value threshold of 10−6, 48% (spruce) and 39.5% (beech) cDNA sequences returned a positive hit in BLASTX searches against the GenBank nr protein database. Among the sequences with no homolog in this preliminary analysis, an additional 227 (spruce) and 238 (beech) contained a registered protein motif identified using InterProScan.
Using Blast2GO, a Gene Ontology (GO) term (associated to either a biological process, a molecular function or a cellular component GO category) could be assigned to 31.5% (beech) and 39% (spruce) of the cDNAs. The five most abundant “biological process” and “molecular function” GO categories were the same for the beech and spruce datasets and altogether accounted for respectively 44% and 43% of the sequences annotated with either a “biological process” and “molecular function” GO term (Figure S2). The two most abundant categories were related to protein synthesis (ribosome biogenesis and translation) and contained, for example, ribosomal proteins encoding sequences.
Finally, 12–13% of the sequences in each datasets encoded proteins with enzymatic activities to which an Enzyme Commission (E.C.) number could be associated (Table 1). Two hundred different E.C. numbers were placed into 117 (spruce) and 112 (beech) different KEGG pathways. Several of these pathways such as glycolysis, tricarboxylic acids (TCA) cycle, nitrogen and sulfur metabolisms, amino acid biosynthesis or degradation (Table 2) are essential to appreciate basic soil processes such as global microbial activities and soil nutrient assimilation. For several of these pathways we identified key enzymes that represent the obligate entry points of major soil nutrients in fungal metabolism (Table S3). This is for example the case of the glutamine synthase (E.C. dehydrogenase (E.C. synthase (E.C. for ammonium; the adenylyl-sulfate kinase (E.C. and the sulfate adenylyltransferase (E.C. for sulfate and aldehyde reductase (E.C. for pentoses (xylose, arabinose) which are end products of hemicelluloses hydrolysis (Table S3).

Table 2. A list of KEGG pathways relevant to soil nutrient (C, N, S) utilisation and microbial metabolism for which different cDNA sequences from the spruce and beech soil could be affiliated.


Targeted annotation of the cDNAs

Our annotation effort concentrated on gene categories encoding proteins with biochemical activities directly linked to the breakdown of plant compounds and other complex nutrients (lignocellulose, chitin, proteins but also secondary metabolites) and polypeptides involved in the mobilisation of soil nutrients such as transmembrane transporters. This targeted annotation was essentially performed through sequence homology searches against specialized protein databases, which compile members of each of these different protein categories, or by using reference sequences when such specialized databases did not exist (in the case of phytases and dioxygenases). Results were further filtered manually and/or by using the MetaBioME database [24] to remove enzymes with no obvious links to the studied processes (such as for example intracellular proteases component of the proteasome).
This annotation strategy identified a large variety of enzymes involved in the above-mentioned processes including many different enzymes classes active on most plant cell wall polymers (cellulose, hemicellulose, pectin and lignin) but also on starch and chitin which, as a component of arthropods exoskeleton and fungal cell wall, represents an abundant nitrogen source in forest soils (Table 3). Altogether, Glycoside Hydrolases (GH CAZymes) represented between 0.5% (beech) and 0.8% (spruce) of all cDNAs. In addition we also identified a set of enzymes responsible for the breakdown of non-cell wall organic molecules such as different classes of proteases, a phytase, a cutinase and a putative carotenoid ester lipase (Table S4). These latter enzymes were all highly similar to known excreted fungal enzymes.

Table 3. An illustration of the diversity of potential plant cell wall and other polysaccharides (starch, chitin) degrading enzymes identified among the beech and spruce soil ESTs.


Given that organic matter input in soil also results in the input of toxic secondary metabolites, we searched for enzymes classes potentially involved in their detoxification. While we only identified a single putative fungal dioxygenase (cathechol dioxygenase) among sequences from the spruce soil, we identified a total of 26 P450 monooxygenases belonging to 12 different families (as defined in the CYPED [25]database, Table S4). Among the CYPED homologous sequences, several did indeed contribute to the breakdown of toxic compounds such as benzoate or the flavonoid phytoalexin pisatin while others were part of biosynthetic pathways such as an O-methylsterigmatocystin oxidoreductase involved in aflatoxin production.
Besides their involvement in organic matter processing, most of these different enzymes are targets for the bioindustry. We therefore performed a global annotation of the environmental EST datasets with respect to their content in genes of biotechnological interest by blasting all sequences against the MetaBioME database which compiles sequences of “Commercially Useful Enzymes” (CUE). For both spruce and beech soils, more than 3% of the sequences returned a positive hit in this analysis and the two most represented CUE classes were the oxidoreductases and the hydrolases (Figure S3). About half of these sequences were from fungi (Figure S3).
Both spruce and beech soil metatranscriptomes included significant proportions (6.2 and 4.8% respectively; E-value threshold of 10−6) of putative membrane transporters as identified by blasting all cDNA sequences against the Transporter Classification Database (TCDB;Table 4[26]. The most abundant category of fungal transporters involved in the assimilation of soil nutrients was the sugar porter one (T.C. 2.A.1.1) implicated in both hexoses (eg glucose, galactose, mannose) and pentoses (eg arabinose, xylose) assimilation, followed by the different families participating to amino acid and phosphate uptake. Concerning inorganic nutrients, besides phosphate transporters, we also identified ammonium transporters (T.C.1.A.11.) but neither nitrate (T.C.2.A.1.8) nor sulfate (T.C.2.A.53.1) ones.

Table 4. Diversity of fungal plasma membrane transporters, potentially involved in soil nutrient uptake, identified among the beech and spruce soil cDNAs.


For most of the explored gene categories we consistently identified a higher number of homologous sequences in the spruce dataset compared to the beech one (Tables 3S3 and 4). The most striking difference concerned the sugar porter family with 31 sequences in the spruce forest metatranscriptome (0.34% of the sequences) against only one in the beech forest data set (0.01%).

Identification of full-length CAZymes

Twelve presumably full-length Carbohydrate Active Enzymes (CAZyme) encoding cDNAs, belonging to families CE1 (accession no. FR865947), GH5 (FR865941), GH7 (FR865943-4), GH11 (FR865942), GH45 (FR865945), GH61 (FR865936-40) and PL1 (FR865946), were identified and sequenced entirely. The longest cDNA clone was 1671 bp-long and encoded a putative GH7 cellobiohydrolase. Except for the CE1 family member, which is thought to be an intracellular enzyme with S-formyl glutathione hydrolase activity, all other 11 predicted protein sequences possessed a N-terminal signal peptide in agreement with the potential polysaccharide degrading activities of the corresponding enzymes.
The putative taxonomic/phylogenetic origin of these 12 cDNAs was assessed using Bayesian (MrBayes) and/or maximum likelihood (PhyML) phylogenetic analyses of the deduced amino acid sequences. Protein sequences used in the phylogenetic analyses included not only the CAZyme best BLASTX hits from GenBank nr database but also homologous sequences identified among GenBank non fungal ESTs in addition to the gene models predicted from 57 recently released fungal and non fungal (animals, choanoflagellate) genome sequences (Table S5).
Each of the 12 full-length environmental sequences were homologous to fungal sequences and to sequences from other taxonomic groups either closely related to the Fungi, such as the Metazoa (GH7 and GH45) or the choanoflagellates (CE1), or to more distantly related groups such as the Plantae (GH5, PL1) or the Bacteria (GH5, GH11 and PL1). With the exception of the Carbohydrate Esterase family 1 (CE1) genes, all other studied CAZyme genes were not present in all genomes analysed and, when present, frequently occurred as gene families from which between one and three members per species were used for the phylogenetic analyses.
In many cases, sequences from one specific taxonomic group (eg Fungi or Bacteria) did not group together to form a single homogeneous and statistically well-supported clade (Figure 2Figure S4). This could be due to an insufficient number of phylogenetically-informative characters in the sequences or to a complex evolutionary history of these gene families characterised by frequent gene duplications, gene losses and potential horizontal gene transfers between distantly related taxonomic groups [27]). In fine, only the phylogeny of the CE1 family fitted to the overall phylogeny of the Fungi [28] and the CE1 environmental sequence most likely originated from an Ascomycota Pezizomycotina fungal species (Figure S4D). Several other environmental sequences, including putative pectin lyase (PL1, Figure S4A), xylanase (GH11, Figure S4B), cellobiohydrolase (GH7, Figure 2B), endocellulase (GH45, Figure 2A) and members of the copper-dependent oxidase GH61 family [29] (Figure S4E), did not group with other referenced sequences to form statistically well supported phyla. This suggests that these sequences originated from fungal or non-fungal taxonomic groups distantly related to those from which homologous sequences have already been characterised. Percentages of identical amino acid positions between the environmental sequences and their closest relatives in the phylogenetic analyses (Figures 2 andS4) supported this hypothesis for several sequences. These percentages ranged from as low as 35–38% for two GH61 and the PL1 sequence to 67–70% for a GH7 and the CE1 sequences (Table S6).

Figure 2. Putative phylogenetic origins of three environmental Glycoside Hydrolases belonging to families GH7 (A) and GH45 (B).
Bayesian (MrBayes) phylogenetic trees include protein sequences from different taxonomic groups, each identified by a specific colour. Diagrams drawn to scale illustrate the modular structure of each of the different GH45 protein sequences. Red rectangles, potential signal peptides; blue rectangles, GH45 domains used for the phylogenetic analysis; orange rectangles, family one Carbohydrate Binding Modules (CBM1) characteristic of fungal GH45; purple rectangle, CBM2 module only found in the environmental sequence. Boxed species names indicate protein sequences for which an endoglucanase catalytic activity has been experimentally established [30][33]. Posterior branch probabilities above 0.8 are given; branches with less than 0.5 probability support were collapsed.


Two of the environmental CAZymes potentially originated from the soil fauna. This was the case for one of the two analysed GH7 (putative cellobiohydrolase Env-GH7/16YJ11; Figure 2A) family members that grouped with the recently identified crustacean cellobiohydrolase polypeptides. This was also the case for the GH45 family member encoding a putative endocellulase (Figure 2B). This latter environmental sequence grouped with Molluscan GH45 sequences but differed from these sequences by the presence of a family two carbohydrate binding module (CBM2) at its C-terminal end (Figure 2B). Carbohydrate binding modules existed in several fungal GH45 proteins but they all belonged to the unrelated CBM1 family.

Discussion Top

This report on the systematic sequencing of environmental eukaryotic cDNAs establishes the main features of this experimental approach. First, the specific analysis of polyadenylated mRNA allows access to the protein-coding gene pool of eukaryotes and efficiently counter-selects not only rRNA but also “non-eukaryotic” mRNA. This is in contrast to the studies published so far on the analysis of “global, essentially prokaryotic, metatranscriptomes” [34][38] which result in sequence datasets comprising as little as less than 1% [35][36] to often more than 50% of rRNA sequences despite the use of different protocols to eliminate these molecules [39][40]. Furthermore, with less than 3% of cDNAs attributed to bacteria in both beech and spruce soil metatranscriptomes, polyadenylation of bacterial mRNA [41] does not seem to represent a challenge for the study of eukaryotic environmental mRNA. Moreover, the exact taxonomic origin of cDNAs attributed to bacteria should to be carefully evaluated as they may include genuine eukaryotic genes not yet identified in this taxonomic group such as bacterial genes recently acquired by eukaryotes by horizontal gene transfer; as for example recently suggested for nematodes carbohydrate active enzymes (CAZymes; [27]).
Secondly, among the cDNAs, between 52 and 60% have no homologs following BLASTX searches against the GenBank nr database. These figures, similar to the one reported for the Muskoxen rumen eukaryotic metatranscriptome [13], are higher than the percentages of orphan genes revealed in recently published genomic sequences from fungi and invertebrate phyla, that dominate the two studied soil eukaryotic communities. Figures vary widely from as low as roughly 5% for the entomopathogenic Ascomycota fungi Metarhizum sp. [42] to 20% for the symbiotic species Tuber melanosporum [43]. In the case of the insect Pediculus humanus [44] and the crustacean Daphnia pulex [45], two arthropods distantly related to other fully sequenced species from this phylum, the percentages of genes without homologs have been estimated to be of 10 and 36% respectively. Factors susceptible to accentuate the low percentage of cDNA taxonomic affiliation in eukaryotic metatranscriptomes include the presence of poorly studied protist groups in soils and also the large proportion of short sequences located at the 3′ end of the cDNAs which comprise untranslated 3′ UTRs. Future use of pyrosequencing or of other high-throughput sequencing technologies may increase the percentages of short reads, a factor that will be mitigated by a sharp increase in the number of sequences produced. All these figures must however be considered as transitory. Indeed, the current exponential increase in the number of available genome sequences in conjunction with a broader taxonomic sampling of the sequenced taxa, more representative of the soil biota, should have a direct and rapid positive impact on the taxonomic annotation of eukaryotic environmental sequences. More specifically it will allow us to carefully evaluate the contribution of protist taxa to the soil metatranscriptomes. Regarding functional annotation, better assignation will depend upon both (i) functional studies on individual unknown environmental sequences [46] and (ii) functional validation of unknown proteins from model organisms.
Despite the large proportion of housekeeping genes involved in basic cell maintenance mechanisms, both global (KEGG pathways) and targeted annotation identified a large variety of genes of interest in a context of soil ecosystem functioning and more specifically in the turnover of plant biomass and soil nutrient cycling and utilization. These nutrients encompass inorganic (ammonium, phosphates) and organic (aminoacids, peptides, phytate) forms of nitrogen and phosphorus, sulfates and simple sugars resulting essentially from either rhizodeposition or lignocellulose breakdown. We also noticed that genes representative of other key pathways such as the nitrate assimilation one (which comprises specific transporters, nitrate and nitrite reductases) were however missing in the datasets which could indicate that the corresponding nutrient represents a minor N source in the studied acidic forest soils.
Enzymes participating to plant cell wall deconstruction are of special interest not only from an ecological point of view to quantify and understand soil C turnover, but also because they are among the most widely used enzymes in the industry and are instrumental to the development of second generation biofuels. We identified among the beech and/or spruce soil cDNAs representatives of enzymes active on the main plant cell wall polymers: lignins, cellulose, hemicelluloses and pectins. Genes encoding carbohydrate active enzymes clearly outnumbered those encoding enzymes directly active on lignin although members of the three lignin oxidase families were identified in at least one of the two metatranscriptomes (Table 3). Glycoside Hydrolases (GH) encoding genes represented 0.46% (beech) and 0.78% (spruce) of the analyzed cDNAs. These figures are very similar to those reported for different “carbohydrate-adapted” metagenomes [18]. In a perspective of mining soil eukaryotic metatranscriptomes for such enzymes these values signify that the screening effort could be similar to the screening effort developed for metagenomes [14]. Additional sequencing effort as well as the study of replicate soil samples are nevertheless required before concluding on a potential effect of litter quality (beech versus spruce) on the relative abundance of transcripts related to plant cell wall degradation or nutrient assimilation (eg sugar transporters). It is indeed known that these two species differ with respect to the nature and proportions of polymers present in their cell walls [47][48] and that spruce litter mineralizes more quickly compared to the beech one [49].
The use of Sanger sequencing of cloned cDNAs allowed us to easily identify several full-length genes among which several CAZyme ones. Phylogenetic analysis of the coded protein sequences suggests that many of them are distantly related to protein sequences present in databases. This further emphasizes the interest of eukaryotic metatranscriptomes as a source of novel enzymes with potentially interesting catalytic properties. This also indicates that organisms studied thus far at the molecular level in a context of plant cell wall deconstruction may not be representative of the organisms at work in the soil ecosystem. It is particularly interesting to consider that at least two out of the 12 full-length CAZymes (putative cellobiohydrolase and endocellulase) could originate not from fungi but from animals. A number of recent publications have indeed reported the existence of plant cell wall degrading enzyme genes in the genome of different groups of invertebrates which are represented among the soil fauna[27][31][33][45][50][51]. This could indicate that in addition to their well recognized role in the fragmentation and mixing of plant litter, the soil fauna could also play a direct and potentially major role in the actual hydrolysis of plant carbohydrates in soils which adds up to hydrolysis by their gut microbiomes.
In conclusion, this first report of a eukaryote-specific analysis of soil metatranscriptomes highlights the potential of this experimental approach in two research fields. In the field of environmental biotechnology, eukaryotic metatranscriptomes represent diversified multigenome resources for many different gene categories used in the bioindustry. In this context, construction of environmental cDNA libraries whose potentially large inserts can be sequenced (this study) or directly expressed in heterologous hosts [20][46][52][53]potentially represents the best experimental approach. In the field of environmental sciences, the study of eukaryotic metatranscriptomes will help us to reevaluate the contribution of the different soil eukaryotes to basic and essential soil processes such as organic matter degradation. It will also contribute to compare ecosystems which differ with respect to i.e. vegetation type, soil characteristics and climate and evaluate the relative contribution of these different variables on different soil processes. In this respect, high-throughput sequencing technologies should be implemented.

Materials and Methods Top

Study site and soil sampling

Soil samples were taken from an environmental research observatory forest site located in central France (Breuil-Chenue forest, 47°18′10″N, 4°4′44″E, 638 m above sea level). This site, initially covered by a mixed broadleaved tree forest (Fagus sylvaticaQuercus sessiliflora,Betula verrucosaCorylus avelana) was clear-cut and replanted in 1976 by separate, single forest tree species stands, among which spruce (Picea abies) and beech (Fagus sylvatica). On July the 10th 2007, 14 (spruce stand) and 16 (beech stand) soil samples (8 cm in diameter, 15 cm in depth) were collected along a systematic sampling grid. For each stand, the uppermost 3–7 cm thick organic matter-rich horizon of each core was sieved (2 mm mesh size) and mixed together (100 ml per core) to obtain a composite sample which was immediately frozen at −70 °C. Pedoclimatic parameters, soil characteristics and sampling strategy are detailed in Table S7.

RNA extraction, cDNA libraries construction and sequencing

Total RNA (~100 µg) was extracted from ~90 g of each of the two composite forest soil samples as described in [20] and [54]. Polyadenylated mRNA were separated from non-polyadenylated RNAs (rRNAs, mitochondrial and prokaryotic mRNAs, tRNAs) by affinity capture on paramagnetic beads coated with poly-dT (Dynabeads Oligo (dT) kit, Dynal). Unbound non-polyadenylated RNAs were recovered by ethanol precipitation. Reverse transcription was carried out following the SMART cDNA Library Construction Kit instructions (Clontech), using between 90 (beech) and 210 ng (spruce) of purified mRNA. cDNA were further amplified by long distance PCR (LD-PCR) as described in the SMART cDNA Library Construction Kit, by using 20 and 24 PCR cycles for spruce and beech respectively. After size-fractionation to remove cDNA smaller than 400 pb (CHROMA SPIN-400 Columns, Clontech), the cDNAs were digested with Sfi1 and ligated into the Sfi1-digested pDNR-LIB plasmid vector (Clontech).
Each cDNA library was introduced in electro-competent E. coli cells (DH10B, Invitrogen) and 10,000 randomly selected cDNA clones were sequenced from their 5′ end using the M13-20 primer and Sanger chemistry (Genoscope sequencing centre, Ivry, France).

18S rDNA gene libraries construction and sequencing

350 (beech) and 730 (spruce) ng of soil non-polyadenylated RNA were reverse transcribed using 0.2 µg of random hexamers and 200 U of M-MuLV Reverse Transcriptase according to the manufacturer instructions (MBI Fermentas). A ca 560 bp-long fragment located at the 5′-end of the eukaryotic 18S rDNA gene was amplified by PCR using primers Euk1A (CTGGTTGATCCTGCCAG) and Euk516R (ACCAGACTTGCCCTCC) described by [55]. PCR mixtures (25 µl) contained 200 nM of each primer, 200 µM of each dNTP, 1 mM MgCl2, 0.25−1 of bovine serum albumin, 0.625 U of Pfu DNA polymerase, the appropriate buffer (Fermentas) and one tenth of the reverse-transcription mix. Amplification reaction started with an initial denaturation step of 3 min at 96°C, followed by 25 cycles comprising 45 sec at 96°C, 45 sec at 56°C and 2 min at 72°C, and finished with an elongation step of 5 min at 72°C. Amplification products of the expected size from 15 PCR tubes were pooled, isolated from an agarose gel (Nucleospin Extract kit, Macherey-Nagel) and ligated in the plasmid pCR-Blunt II-TOPO (Zero Blunt TOPO PCR Cloning kit, Invitrogen) that was used to transform electro-competent DH10B E. coli cells (Invitrogen).
For each rDNA library, the ca 560 bp rDNA inserts of 96 clones were entirely sequenced (Agowa Company, Berlin, Germany) using universal primer M13-20. Sequences were manually corrected and edited. BLASTn searches were performed against GenBank nr nucleotide database at NCBI ( using default parameters except for word size which was set to 7. Sequences were analyzed with the rRNA Database Project CHECK_CHIMERA program ( Potential chimeras were further analysed by blasting separately the two dissimilar segments of the sequences against GenBank. Confirmed chimeras and artefacts were eliminated from the sequence datasets. All remaining sequences were submitted to EMBL and are available under accession numbers FN393180–FN393221 (beech) and FN393323–FN393380 (spruce).

cDNA sequences cleaning and clustering

All cDNA sequences were trimmed using TIGR SEQCLEAN tool ( with default parameters to eliminate poly-A sequences at the 3′end of cDNA, vector, adaptor and primer sequences and undetermined nucleotides. Sequences shorter than 100 nucleotides were eliminated. Contaminant ribosomal RNA sequences were identified by BLASTN analysis (cut-off threshold of E-value≤10−10, word-size = 11) of all sequences against LSUrdb and SSUrdb_SSURef100 libraries described in [38] (available at​/community-profiling/), as well as against NCBI-nr nucleotide database (looking for “ribosomal RNA” or “rRNA” in the title). The resulting “cleaned” sequence dataset was submitted to EMBL and is available under accession numbers FR706059–714330 (beech) and FR697056–706058 (spruce).
Cleaned sequences were clustered using Cd-hit ([56]) using a 90% identity threshold. The resulting clusters were used to perform a rarefaction analysis using S. Holland’s Analytical Rarefaction version 1.3 software (

Global cDNA sequences annotation

Cleaned cDNA sequences were queried (BLASTX) against NCBI nr protein database (, using 2 different strategies, (i) Q-BLAST via BLAST2GO [57] with parameters, E-value≤10−6 and overlapping length >33% to the corresponding best hit and, (ii) NetBlast 2.2.22 (E-value≤10−6). As both analyses gave similar percentages of annotated cDNA sequences, we only considered the Q-BLAST results. In addition to raw BLASTX analyses, we also used BLAST2GO to perform Gene Ontology (GO, [58]), Enzyme code (E.C.) and InterPro (conserved patterns in sequences) annotations.

Targeted annotation of cDNAs

Cleaned sequence datasets were searched for sequences similar to genes coding for organic matter degradation enzymes and for enzymes of biotechnological interest, using BLASTX searches (default parameters) against various specialized databases. Searches were made against Carbohydrate Active Enzymes (CAZy,[59]); and Fungal Oxidative Lignin Enzymes (FOLy;[60]) databases to find plant cell wall degradation enzymes. cDNA similar to cytochrome P450 potentially involved in detoxication of plant secondary metabolites were searched against CYPED database (downloaded on October 2009,[61]). Lipases/esterases and proteases were searched against the Lipases Engineering database (LED, downloaded on December 2009;[62]) and MEROPS (downloaded on January 2010,[63]) respectively. Membrane transporters (for sugar, amino acids, oligopeptides and phosphate) were searched against the Transporter Classification Database (downloaded on April 2010,[26]). For other enzymes (phytases and dioxygenases), for which no specific databases exist, reference protein sequence were extracted from GenBank/EMBL/DDJB database and used as queries to perform a tBLASTn analysis (default parameters) against our cDNA libraries.
For all these specific analyses, the functional annotations obtained were further confirmed using a BLASTX analysis (default parameters) of the corresponding cDNAs against GenBank nr and/or Swissprot protein databases. We only retained cDNA sequences similar to the fungal targeted enzyme using a E-value threshold of 0.001. For cytochrome P450, lipases, proteases, phytases and dioxygenases, identified cDNAs were further filtered by BLASTX (default parameter, E-value≤10−6) against a curated database of commercially useful enzymes (MetaBioME,​dex.php[24]) to retain only sequences similar to enzymes, which have known applications in industries. To identify full-length sequences, annotated cDNAs were translated and aligned with the closest protein sequences identified in Genbank/EMBL/DDJB to search for a putative start codon in the N-terminal region.

Taxonomic annotation of rRNA and cDNA sequences

For PCR-amplified 18S ribosomal sequences, BLASTN searches (default parameters, except word size set at 7) were performed against the GenBank nr nucleotide database at NCBI ( 18S sequences were then attributed to major eukaryotic phyla (listed in Figure 1 and Table S2) by performing phylogenetic analyses (BioNJ and PhyM) using for the sequence alignments the best Blast hits and a set of reference sequences for each of the different phyla. Analyses (sequence alignments and phylogenetic analyses) were performed using Seaview [64].
Using BLASTX output files, cDNAs were taxonomically assigned using MEGAN V 3.5 (Metagenome analyser,​are/megan[65]). This program uses the result of a BLAST comparison and assigns each read to a taxon at a specific taxonomic level. All parameters of MEGAN were kept at default values. For the cDNA taxonomic assignation, an additional analysis was performed by setting the “min support” option (the minimum number of sequence reads that must be assigned to a taxon) to one instead of 5 used as default parameter.

Phylogenetic analyses

Protein sequences used for phylogenetic reconstructions were selected among BLAST results against different databases. Each of the environmental protein sequences were blasted against the GenBank/EMBL/DDJB nr protein database (BLASTX) and non human EST one (TBLASTN). Additional homology searches (BLASTX) were performed against the proteomes of fully sequenced organisms available through different websites (Table S5). Depending on the number and taxonomic diversity of the best BLAST hits, between one/two (individual genome sequences) and five (GenBank/EMBL/DDJB) protein sequences were selected for the sequence alignments.
Alignments were performed on the platform ([66]) using MUSCLE (default parameters) followed by the selection of phylogenetically informative regions using GBlocks (low stringency parameters). The resulting cured alignment were used for phylogenetic analyses using MrBayes (less than 30 sequences) and/or PhyML. Parameters for Bayesian analyses were set as followed: number of substitution type, 6 (GTR); substitution model, WAG; rates variation across sites, invariable plus Gamma; analyses were run for 100,000 generations, sampling tree every 10 generations, burning of 100. PhyML was run on the Seaview platform [64] using default parameters except for the WAG substitution model.

Supporting Information Top

Figure S1.

Clustering of the cDNA datasets. (A) Rarefaction curves plotting the no. of cDNA sequences against the no. of clusters showing that most sequences are unique; (B) and (C), size distribution of the clusters showing that few of them contains more than 4 sequences.

Figure S2.

The five most represented Gene Ontology (GO) categories are the same for the spruce and beech datasets.

Figure S3.

Global biotechnological potential of the cDNA datasets.Distribution of cDNA sequences homologous to “Commercially Useful Enzymes” (CUEs) in the MetaBioME database according to enzyme activity (E.C. no.). Analysis was performed separately for all cDNAs and for those affiliated to the fungi (see Fig. 1).

Figure S4.

Putative phylogenetic origins of nine full-length environmental CAZymes. Environmental sequences (in orange) belong to families PL1 (A), GH11 (B), GH5 (C), CE1 (D) and GH61 (E). Maximun likelihood (PhyML) phylogenetic trees include protein sequences from different taxonomic groups, each identified by a specific colour; red, Fungi Basidiomycota; pink, Fungi Ascomycota; orange, other Fungi; Black, Choanoflagellida; brown, Bacteria; green, Plantae. Correspondence between numbers and species names is given in Table S5. Black diamonds point to the sequences used for the calculation of the percentages of amino acid identity and similarity with environmental sequences (Table S6).

Table S1.

Characteristics of the PCR-amplified 18S rRNA sequence datasets.

Table S2.

Taxonomic affiliation of the 18S rRNA and cDNA sequence datasets.

Table S3.

An illustration of some of the key enzymes identified in major KEGG metabolic pathways relevant to either C, N or S metabolism.

Table S4.

An illustration of the diversity of potential organic matter degrading enzymes, other than plant cell wall and polysaccharide active ones.

Table S5.

Origin of the CAZyme sequences used in the phylogenetic analyses.

Table S6.

Percentages of conserved, identical and similar, amino acid positions between the 12 full length environmental CAZyme proteins and one or two of their closest phylogenetically related neighbours.

Table S7.

Stand and sampling characteristics.

Acknowledgments Top

Global annotation of the cDNA sequences was performed at the PRABI Bioinformatics platform of the university Lyon 1 under supervision of Christian Gautier. We acknowledge Pedro Coutinho (AFMB, Marseille) and Eric Record (Biotechnologie des Champignons Filamenteux, Marseille) for the search of CAZyme and FOLyme genes respectively and the Joint Genome and Broad Institutes for making available annotated eukaryotic genomes prior to publication.

Author Contributions Top

Conceived and designed the experiments: CD LF-T RM. Performed the experiments: CD. Analyzed the data: CD FL PL CO-D RM. Contributed reagents/materials/analysis tools: JR CO-D. Wrote the paper: RM CD FL PL LF-T CO-D.

References Top

  1. Meier C, Bowman WD (2008) Links between plant litter chemistry, species diversity, and below-ground ecosystem function. Proc Natl Acad Sci U S A 105: 19780–19785. FIND THIS ARTICLE ONLINE
  2. Cornwell WK, Cornelissen JH, Amatangelo K, Dorrepaal E, Eviner VT, et al. (2008) Plant species traits are the predominant control on litter decomposition rates within biomes worldwide. Ecol Lett 11: 1065–1071. FIND THIS ARTICLE ONLINE
  3. Sinsabaugh RL, Lauber CL, Weintraub MN, Ahmed B, Allison SD, et al. (2008) Stoichiometry of soil enzyme activity at global scale. Ecol Lett 11: 1252–1264. FIND THIS ARTICLE ONLINE
  4. Martinez D, Larrondo LF, Putnam N, Gelpke MD, Huang K (2004) Genome sequence of the lignocellulose degrading fungusPhanerochaete chrysosporium strain RP78. Nat Biotechnol 22: 695–700. FIND THIS ARTICLE ONLINE
  5. Martinez D, Berka RM, Henrissat B, Saloheimo M, Arvas M, et al.(2008) Genome sequencing and analysis of the biomass-degrading fungus Trichoderma reesei (syn. Hypocrea jecorina). Nat Biotechnol 26: 553–560. FIND THIS ARTICLE ONLINE
  6. Martinez D, Challacombe J, Morgenstern I, Hibbett D, Schmoll M, et al. (2009) Genome, transcriptome, and secretome analysis of wood decay fungus Postia placenta supports unique mechanisms of lignocellulose conversion. Proc Natl Acad Sci U S A 106: 1954–1959.FIND THIS ARTICLE ONLINE
  7. Jeffries TW, Grigoriev IV, Grimwood J, Laplaza JM, Aerts A, et al.(2007) Genome sequence of the lignocellulose-bioconverting and xylose-fermenting yeast Pichia stipitis. Nat Biotechnol 25: 319–26.FIND THIS ARTICLE ONLINE
  8. Frankland JC (1998) Fungal succession – unravelling the unpredictable. Mycol Res 102: 1–15. FIND THIS ARTICLE ONLINE
  9. Nielsen UN, Osler GHR, Campbell CD, Burslem DFRP, van der Wal R(2010) The influence of vegetation type, soil properties and precipitation on the composition of soil mite and microbial communities at the landscape scale. J Biogeogr 37: 1317–1328.FIND THIS ARTICLE ONLINE
  10. Zinger L, Lejon DPH, Baptist F, Bouasria A, Aubert S, et al. (2011) Contrasting diversity patterns of crenarchaeal, bacterial and fungal soil communities in an alpine landscape. PLoS ONE 6: e19950. FIND THIS ARTICLE ONLINE
  11. Brulc JM, Antonopoulos DA, Berg Miller ME, Wilson MK, Yannarell AC(2009) Gene-centric metagenomics of the fiber-adherent bovine rumen microbiome reveals forage specific glycoside hydrolases. Proc Natl Acad Sci U S A 106: 1948–1953. FIND THIS ARTICLE ONLINE
  12. Hess M, Sczyrba A, Egan R, Kim TW, Chokhawala H, et al. (2011) Metagenomic discovery of biomass-degrading genes and genomes from cow rumen. Science 331: 463–7. FIND THIS ARTICLE ONLINE
  13. Qi M, Wang P, O’Toole N, Barboza PS, Ungerfeld E, et al. (2011) Snapshot of the Eukaryotic gene expression in Muskoxen rumen – A metatranscriptomic approach. PLoS ONE 6: e20521. FIND THIS ARTICLE ONLINE
  14. Tasse L, Bercovici J, Pizzut-Serin S, Robe P, Tap J, et al. (2010) Functional metagenomics to mine human gut microbiome for dietary fiber catabolic enzymes. Genome Res 20: 1605–1612. FIND THIS ARTICLE ONLINE
  15. Pope PB, Denman SE, Jones M, Tringe SG, Barry K, et al. (2009) Adaptation to herbivory by the Tammar wallaby includes bacterial and glycoside hydrolase profiles different from other herbivores. Proc Natl Acad Sci U S A 107: 14793–798. FIND THIS ARTICLE ONLINE
  16. Tartar A, Wheeler M, Zhou X, Coy MR, Boucias DG, et al. (2009) Parallel metatranscriptome analyses of host and symbiont gene expression in the gut of the termite Reticulotermes flavipes. Biotechnol Biofuels 2: 25. FIND THIS ARTICLE ONLINE
  17. Warnecke F, Luginbühl P, Ivanova N, Ghassemian M, Richardson TH, et al. (2007) Metagenomic and functional analysis of hindgut microbiota of a wood-feeding higher termite. Nature 450: 560–5.FIND THIS ARTICLE ONLINE
  18. Allgaier M, Reddy A, Park JI, Ivanova N, D’haeseleer P, et al.(2010) Targeted discovery of glycoside hydrolases from a switchgrass-adapted compost community. PLoS ONE 5: e8812.FIND THIS ARTICLE ONLINE
  19. Grant S, Grant WD, Cowan DA, Jones BE, Ma Y, et al. (2006) Identification of eukaryotic open reading frames in metagenomic cDNA libraries made from environmental samples. Appl Environ Microbiol 72: 135–143. FIND THIS ARTICLE ONLINE
  20. Bailly J, Fraissinet-Tachet L, Verner M-C, Debaud J-C, Lemaire M, et al. (2007) Soil eukaryotic functional diversity, a metatranscriptomic approach. ISME J 1: 632–642. FIND THIS ARTICLE ONLINE
  21. López-García P, Moreira D (2008) Tracking microbial biodiversity through molecular and genomic ecology. Res Microbiol 159: 67–73.FIND THIS ARTICLE ONLINE
  22. Jeon S, Bunge J, Leslin C, Stoeck T, Sunhee H, et al. (2008) Environmental rRNA inventories miss over half of protistan diversity. BMC Microbiol 8: 222. FIND THIS ARTICLE ONLINE
  23. Stephenson SL, Fiore-Donno AM, Schnittler M (2011) Myxomycetes in soil. Soil Biol Biochem 43: 2237–2242. FIND THIS ARTICLE ONLINE
  24. Sharma VK, Kumar N, Prakash T, Taylor TD (2010) MetaBioME: a database to explore commercially useful enzymes in metagenomics datasets. Nucl Ac Res 38: D468–D472. FIND THIS ARTICLE ONLINE
  25. Fischer M, Knoll M, Sirim D, Wagner F, Funke S, et al. (2007) The cytochrome P450 engineering database: a navigation and prediction tool for the cytochrome P450 protein family. Bioinformatics 23: 2015–2017. FIND THIS ARTICLE ONLINE
  26. Saier MH Jr, Tran CV, Barabote RD (2006) TCDB: the transporter Classification Database for membrane transport protein analyses and information. Nucl Acids Res 34: D181–D186. FIND THIS ARTICLE ONLINE
  27. Danchin EG, Rosso MN, Vieira P, de Almeida-Engler J, Coutinho PM, et al. (2010) Multiple lateral gene transfers and duplications have promoted plant parasitism ability in nematodes. Proc Natl Acad Sci U S A 107: 17651–17656. FIND THIS ARTICLE ONLINE
  28. James TY, Kauff F, Schoch CL, Matheny PB, Hofstetter V, et al.(2006) Reconstructing the early evolution of Fungi using a six-gene phylogeny. Nature 443: 818–22. FIND THIS ARTICLE ONLINE
  29. Quinlan RJ, Sweeney MD, Leggio LL, Otten H, Poulsen J-CN, et al.(2011) Insight into oxidative degradation of cellulose by a copper metalloenzyme that exploits biomass components. Proc Natl Acad Sci U S A 108: 15079–15084. FIND THIS ARTICLE ONLINE
  30. Liu G, Wei X, Qin Y, Qu Y (2010) Characterization of the endoglucanase and glucomannanase activities of a glycoside hydrolase family 45 protein from Penicillium decumbens 114-2. J Gen Appl Microbiol 56: 223–229. FIND THIS ARTICLE ONLINE
  31. Sakamoto K, Toyohara H (2009) Molecular cloning of glycoside hydrolase family 45 cellulase genes from brackish water clamCorbicula japonica. Comp Biochem Physiol B Biochem Mol Biol 152: 390–396. FIND THIS ARTICLE ONLINE
  32. Saloheimo A, Henrissat B, Hoffrén AM, Penttilä M (1994) A novel, small endoglucanase gene, egl5, from Trichoderma reesei isolated by expression in yeast. Mol Microbiol 13: 219–228. FIND THIS ARTICLE ONLINE
  33. Xu B, Hellman U, Ersson B, Janson J-C (2000) Purification, characterization and amino-acid sequence analysis of a thermostable, low molecular mass endo-ß-1,4-glucanase from blue mussel, Mytilus edulis. Eur J Biochem 267: 4970–4977. FIND THIS ARTICLE ONLINE
  34. Frias-Lopez J, Shi Y, Tyson GW, Coleman ML, Schuster SC, et al.(2008) Microbial community gene expression in ocean surface waters. Proc Natl Acad Sci U S A 105: 3805–3810. FIND THIS ARTICLE ONLINE
  35. Gilbert J, Field D, Huang Y, Edwards R, Li W, et al. (2008) Detection of large numbers of novel sequences in the metatranscriptomes of complex marine microbial communities. PLoS ONE 3: e3042. FIND THIS ARTICLE ONLINE
  36. Gilbert JA, Field D, Swift P, Thomas S, Cummings D, et al. (2010) The taxonomic and functional diversity of microbes at a temperate coastal site: a ‘multi-omic’ study of seasonal and diel temporal variation. PLoS ONE 5: e15545. FIND THIS ARTICLE ONLINE
  37. Shrestha PM, Kube M, Reinhardt R, Liesack W (2009) Transcriptional activity of paddy soil bacterial communities. Environ Microbiol 11: 960–970. FIND THIS ARTICLE ONLINE
  38. Urich T, Lanzen A, Qi J, Huson DH, Schleper C, et al. (2008) Simultaneous assessment of soil microbial community structure and function through analysis of the metatranscriptome. PLoS ONE 3: e2527. FIND THIS ARTICLE ONLINE
  39. He S, Wurtzel O, Singh K, Froula JL, Yilmaz S, et al. (2010) Validation of two ribosomal RNA removal methods for microbial metatranscriptomics. Nature Met 10: 807–812. FIND THIS ARTICLE ONLINE
  40. Stewart FJ, Ottesen EA, DeLong EF (2010) Development and quantitative analyses of a universal rRNA-substraction protocol for microbial metatranscriptomics. ISME J 4: 896–907. FIND THIS ARTICLE ONLINE
  41. Sharkar N (1997) Polyadenylation of mRNA in prokaryotes. Ann Rev Biochem 66: 173–197. FIND THIS ARTICLE ONLINE
  42. Gao Q, Jin K, Ying S-H, Zhang Y, Xiao G, et al. (2011) Genome sequencing and comparative transcriptomics of the model entomopathogenic fungi Metarhizium anisopliae and M. acridum. PLoS Genet 7: e1001264. FIND THIS ARTICLE ONLINE
  43. Martin F, Kohler A, Murat C, Balestrini R, Coutinho PM, et al. (2010) Périgord black truffle genome uncovers evolutionary origins and mechanisms of symbiosis. Nature 464: 1033–1038. FIND THIS ARTICLE ONLINE
  44. Kirkness EF, Haas BJ, Sun W, Braig HR, Perotti MA, et al. (2010) Genome sequences of the human body louse and its primary endosymbiont provide insights into the permanent parasitic lifestyle. Proc Natl Acad Sci U S A 107: 12168–12173. FIND THIS ARTICLE ONLINE
  45. Colbourne JK, Pfrender ME, Gilbert D, Thomas WK, Tucker A, et al.(2011) The ecoresponsive genome of Daphnia pulex. Science 331: 555–561. FIND THIS ARTICLE ONLINE
  46. Damon C, Vallon V, Zimmermann S, Haider MZ, Galeote V, et al.(2011) A novel fungal family of oligopeptide transporters identified by functional metatranscriptomics of soil eukaryotes. ISME J. doi:10.1038/ismej.2011.67.
  47. Sarkar P, Bosneaga E, Auer M (2009) Plant cell walls throughout evolution: towards a molecular understanding of their design principles. J Exp Bot 60: 3615–3635. FIND THIS ARTICLE ONLINE
  48. Schädel C, Blöchl A, Richter A, Hoch G (2010) Quantification of monosaccharide composition of hemicelluloses from different plant functional types. Plant Physiol Biochem 48: 1–8. FIND THIS ARTICLE ONLINE
  49. Moukoumi J, Munier-Lamy C, Berthelin J, Ranger J (2006) Effect of tree species substitution on organic matter biodegradability and mineral nutrient availability in a temperate topsoil. Ann For Sci 63: 763–771. FIND THIS ARTICLE ONLINE
  50. King AJ, Cragg SM, Li Y, Dymond J, Guille MJ, et al. (2010) Molecular insight into lignocellulose digestion by a marine isopod in the absence of gut microbes. Proc Natl Acad Sci U S A 107: 5345–5350. FIND THIS ARTICLE ONLINE
  51. Pauchet Y, Wilkinson P, Chauhan R, Ffrench-Constant RH (2010) Diversity of beetle genes encoding novel plant cell wall degrading enzymes. PLoS ONE 5: e15635. FIND THIS ARTICLE ONLINE
  52. Kellner H, Luis P, Portetelle D, Vandenbol M (2010) Screening of a soil metatranscriptomic library by functional complementation ofSaccharomyces cerevisiae mutants. Microbiol Res 166: 360–368.FIND THIS ARTICLE ONLINE
  53. Findley SD, Mormile MR, Sommer-Hurley A, Zhang X-C, Tipton P, et al. (2011) Activity-based metagenomic screening and biochemical characterization of bovine ruminal protozoan glycoside hydrolases. Appl Environ Microbiol. doi:10.1128/AEM.05925-11.
  54. Damon C, Barroso G, Férandon C, Ranger J, Fraissinet-Tachet L, et al. (2010) Performance of the COX1 gene as a marker for the study of metabolically active Pezizomycotina and Agaricomycetes fungal communities from the analysis of soil RNA. FEMS Microbiol Ecol 74: 693–705. FIND THIS ARTICLE ONLINE
  55. Diez B, Pedros-Alio C, Massana R (2001) Study of genetic diversity of eukaryotic picoplankton in different oceanic regions by small-subunit rRNA gene cloning and sequencing. Appl Environ Microbiol 67: 2932–2941. FIND THIS ARTICLE ONLINE
  56. Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22: 1658–1659. FIND THIS ARTICLE ONLINE
  57. Conesa A, Götz S, García-Gómez JM, Terol J, Talón M, et al. (2005) Blast2GO: A universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 21: 3674–367. FIND THIS ARTICLE ONLINE
  58. The Gene Ontology Consortium (2000) Gene ontology: tool for the unification of biology. Nat Genet 25: 25–29. FIND THIS ARTICLE ONLINE
  59. Cantarel BL, Coutinho PM, Rancurel C, Bernard T, Lombard V, et al.(2009) The Carbohydrate-Active EnZyme database (CAZy): an expert resource for glycogenomics. Nucl Ac Res 37: D233–D238.FIND THIS ARTICLE ONLINE
  60. Levasseur A, Piumi F, Coutinho PM, Rancurel C, Asther M, et al.(2008) FOLy: an integrated database for the classification and functional annotation of fungal oxidoreductases potentially involved in the degradation of lignin and related aromatic compounds. Fung Genet Biol 45: 638–645. FIND THIS ARTICLE ONLINE
  61. Sirim D, Wagner F, Lisitsa A, Pleiss J (2009) The cytochrome P450 Engineering Database: integration of biochemical properties. BMC Biochem 10: 27. FIND THIS ARTICLE ONLINE
  62. Fischer M, Pleiss J (2003) The lipase engineering database: a navigation and analysis tool for protein families. Nucl Ac Res 31: 319–321. FIND THIS ARTICLE ONLINE
  63. Rawlings ND, Barrett AJ, Bateman A (2009) MEROPS: the peptidase database. Nucl Ac Res 38: D227–D233. FIND THIS ARTICLE ONLINE
  64. Gouy M, Guindon S, Gascuel O (2010) SeaView version 4: a multiplatform graphical user interface for sequence alignment and phylogenetic tree building. Mol Biol Evol 27: 221–224. FIND THIS ARTICLE ONLINE
  65. Huson DH, Richter DC, Mitra S, Auch AF, Schuster SC (2009) Methods for comparative metagenomics. BMC Bioinformatics 10: Suppl 1S12. FIND THIS ARTICLE ONLINE
  66. Dereeper A, Guignon V, Blanc G, Audic S, Buffet S, et al. (2008) a robust phylogenetic analysis for the non-specialist. Nucl Acids Res 36: W465–W469. FIND THIS ARTICLE ONLINE
  67. Coutinho PM, Andersen MR, Kolenova K, vanKuyk PA, Benoit I, et al. (2009) Post-genomic insight into the plant polysaccharide degradation potential of Aspergillus nidulans and comparison toAspergillus niger and Aspergillus oryzae. Fung Genet Biol 46: S161–S169. FIND THIS ARTICLE ONLINE