Crosspost from http://microBE.net: New, massive volumes on #metagenomics coming out soon

For those interested in microbial diversity and/or metagenomics there are two volumes that are coming out soon that are of interest:

Edited by Frans J. de Bruijn these two volumes are the most comprehensive coverage of metagenomics out there right now. The chapters are almost overwhelming (full disclosure, I have two chapters in here – both of which are republications of Open Access papers I have published on metagenomics).  See below for full chapter lists.

Order from Amazon:

Volume I: Metagenomics and Complementary Approached

  • 1. Introduction (Frans J. de Bruijn).
  • Background Chapters.
    • 2. DNA reassociation yields broad-scale information on metagenome complexity and microbial diversity (V. Torsvik).
    • 3. Diversity of 23S rRNA genes within individual prokaryotic genomes (Zhiheng Pei).
    • 4. Use of the rRNA operon and genomic repetitive sequences for the identification of bacteria (A. Nascimento).
    • 5. Use of different PCR primer-based strategies for characterization of natural microbial communities (James Prosser).
    • 6. Horizontal gene transfer and recombination shape mesorhizobial populations in the gene center of the host plants Astragalus luteolus and Astragalus ernestii in Sichuan, China (Xiaoping Zhang).
    • 7. Amplified rDNA restriction analysis (ARDRA)for identification and phylogenetic placement of 16S-rDNA clones (Menachim Sklarz).
    • 8. Clustering-based peak alignment algorithm for objective and quantitative analysis of DNA fingerprinting data (Satoshi Ishii).
  • The Species Concept.
    • 9. Population genomics informs our understanding of the bacterial species concept (Margaret Riley).
    • 10. Genome analysis of Streptococcus agalactiae: Implication for the microbial “pan-genome” (Rino Rappuoli).
    • 11. Metagenomic insights into bacterial species (Kostas Konstantinidis).
    • 12. Report of the ad hoc committee for the re-evaluation of the species definition in bacteriology (Erko Stackebrandt).
    • 13. Metagenomic Approaches for the Identification of Microbial Species (David Ward).
  • Metagenomics.
    • 14. Microbial Ecology in the age of metagenomics (Jianping Xu).
    • 15. The enduring legacy of small rRNA in microbiology (Susan Tringe).
    • 16. Pitfalls of PCR-based rRNA gene sequence analysis:  an update on some parameters (Erko Stackebrandt).
    • 17. Empirical testing of 16S rRNA gene PCR primer pairs reveals variance in target specificity and efficacy not suggested by in silico analysis (Sergio Morales and Bill Holben).
    • 18. The impact of next-generation sequencing technologies on (meta)genomics (George Weinstock).
    • 19. Accuracy and quality of massively parallel DNA pyrosequencing (Susan Huse and David Mark Welch).
    • 20. Environmental shotgun sequencing: Its potential and challenges for studying the hidden world of microbes (Jonathan Eisen).
    • 21. Comparison of random sequence reads versus 16S rDNA sequences for estimating the biodiversity of a metagenomic library (C. Manischan).
    • 22. Metagenomic libraries for functional screeing (Svein Valla).
    • 23. GC Fractionation Allows Comparative Total Microbial Community  Analysis, Enhances Diversity Assessment, and Facilitates of Minority Populations of Bacteria (Bill Holben).
    • 24. Enriching plant microbiota for a metagenomic library construction (Ying Zeng).
    • 25. Towards Automated Phylogenomic Inference (Wu and Eisen).
    • 26. Integron first gene cassettes: a target to find adaptive genes in metagenomes (Christine Cagnon).
    • 27. High-resolution metagenomics: assessing specific functional types in complex microbial communities (Christoserdova).
    • 28. Gene-targeted –metagenomics (GT-metagenomics) to explore the extensive diversity of genes of interest in microbial communities (J. Tiedje).
    • 29. Phylogenetic screening of metagenomic libraries using homing endonuclease restriction and marker insertion (Torsten Thomas).
    • 30. ArrayOme- & tRNAcc-facilitated mobilome discovery: comparative genomics approaches for identifying rich veins of novel bacterial DNA sequences (Hong-Yu OU).
    • 31. Sequence-Based Characterization of Microbiomes by Serial Analysis of Ribosomal Sequence Tags (SARST) (Zhongtang Yu).
  • Consortia and Databases.
    • 32. The metagenomics of plant pathogen-suppressive soils (J.D. Van Elsas).
    • 33. Soil Metagenomic Exploration of the Rare Biosphere (Pascal Simonet and Timothy Vogel).
    • 34. The BIOSPAS consortium: Soil Biology and agricultural production (Luis Wall).
    • 35. The Human Microbiome Project (George Weinstock).
    • 36. The Ribosomal Database Project: sequences and Software for high-throughput rRNA analysis (J. R. Cole, G. M. Garrity and Jim Tiedje).
    • 37. The metagenomics RAST server- a public resource for the automatic phylogenetic and functional analysis of metagenomes (Folker Meyer).
    • 38. The EBI Metagenomics Archive, Integration and Analysis resource (Apweiler).
  • Computer Assisted Analysis.
    • 39. Comparative metagenome analysis using MEGAN (Suparna Mitra and Daniel Huson).
    • 40. Phylogenetic binning of metagenome sequence samples (Alice C. McHardy).
    • 41. Gene prediction in metagenomic fragments with Orphelia: A large scale machine learning approach (Katharina Hoff).
    • 42. Binning metagenomic sequences using seeded GSOm (Sen-Lin Tang).
    • 43. Iterative read mapping and assembly allows the use of a more distant reference in metagenomic assembly (Bas E. Dutilh).
    • 44. Ribosomal RNA identification in metagenomic and metatranscriptomic datasets (Li).
    • 45. SILVA: comprehensive databases for quality checked and aligned ribosomal RNA sequence data compatible with ARB (Frank Gloeckner).
    • 46. ARB; a software environment for sequence data (Wolfgang Ludwig).
    • 47. The Phyloware Project: A software framework for phylogenomic virtue (Daniel Frank).
    • 48. Metasim- A sequencing simulator for genomics and metagenomics (Daniel Richter).
    • 49. ClustScan: an integrated program package for the detection and semi-automatic annotation of secondary metabolite clusters in genomic and metagenomic DNA datasets (Daslav Hranueli).
    • 50. MetaGene; Prediction of prokaryotic and phage genes in metagenomic sequences (Noguchi).
    • 51. primers4clades, a web server to design lineage-specific PCR primers for gene-targeted metagenomics (Pablo Vinuesa).
    • 52. A parsimony approach to biological pathway reconstruction/inference for genomes and metagenomes (Y. Ye).
    • 53. ESPRIT: estimating species richness using large collections of 16S rRNA data (Yijun Sun).
  • Complementary Approaches.
    • 54. (Meta) genomics approaches in systems biology (Manuel Ferrer).
    • 55. Towards “focused metagenomics”: a case study combining DNA stable-isotope probing, multiple displacement amplification and metagenomics (J. Colin Murrell).
    • 56. Galbraith, E. A., D. A. Antonopoulos, K. E. Nelson, and B. A. White . Suppressive subtractive hybridization reveals extensive horizontal transfer in the rumen metagenome (Bryan White).
  • Microarrays.
    • 57. GeoChip: A high throughout metagenomics technology for dissecting microbial community functional structure (J. Zhou).
    • 58. Phylogenetic microarrays (PhyloChips) for analysis of complex microbial communities (Eoin Brodie).
    • 59. Phenomics and Phenotype MicroArrays: Applications Complementing Metagenomics (Barry Bochner).
    • 60. Microbial persistence in low biomass, extreme environments: The great unknown (Kasthuri Venkateswaran).
    • 61. Application of phylogenetic oligonucleotide microarrays in microbial analysis (Nian Wang).
  • Metatranscriptomics.
    • 62. Isolation of mRNA from environmental microbial communities for metatranscriptomic analyses (P. Schenk).
    • 63. Comparative day/night metatrancriptomic analysis of microbial communities in the North Pacific subtropical gyre (Rachel Poretski).
    • 64. The “double RNA” approach to simultaneously assess the structure and function of environmental microbial communities by meta-transcriptomics (Tim Urich and Christa Schleper).
    • 65. Soil eukaryotic diversity, a metatranscriptomic approach (Marmeisse).
  • Metaproteomics.
    • 66. Proteomics for the analysis of environmental stress responses in prokaryotes Ksenia Groh, Victor Nesatyy and Marc Suter).
    • 67. Microbial community proteomics (Paul Wilmes).
    • 68. Synchronicity between population structure and proteome profiles: A metaproteomic  analysis of Chesapeake Bay bacterial communities (Feng Chen).
    • 69. High-Throughput Cyanobacterial Proteomics: Systems-level Proteome Identification and Quantitation   (Phillip Wright).
    • 70. Protein Expression Profile of an Environmentally Important Bacterial Strain: the Chromate Response of Arthrobacter sp. strain FB24 (K. Henne).
  • Metabolomics.
    • 71. The small molecule dimension: Mass spectrometry based metabolomics, enzyme assays, and imaging (Trent R. Northen).
    • 72. Metabolomics: high resolution tools offer to follow bacterial growth on a molecular level (Lucio Marianna and Philipp Schmitt-Kopplin).
    • 73. Metabolic profiling of plant tissues by electrospray mass spectrometry (Heather Walker).
    • 74. Metabolite identification, pathways and omic integration using online databases and tools (Matthew Davey).
  • Single cell analysis.
    • 75. Application of cytomics to separate natural microbial communities by their physiological properties (Susann Müller).
    • 76. Capturing microbial populations for environmental genomics (A. Pernthaler/Wendeberg).
    • 77. Microscopic single-cell isolation and multiple displacement amplification of genomes from uncultured prokaryotes (Peter Westermann).

Volume 2: Metagenomics in Different Habitats

  • 1. Introduction (Frans J. de Bruijn).
  • Viral Genomes.
    • 2. Viral metagenomics (Shannon Williamson).
    • 3. Methods in Viral Metagenomics (Thurber).
    • 4. Metagenomic contrasts of viruses in soil and aquatic environments (Eric Wommack).
    • 5. Biodiversity and biogeography of phages in modern stromatolites and thromolites (Christelle Desnues).
    • 6. Assembly of Viral Metagenomes from Yellowstone Hot Springs Reveals Phylogenetic Relationships and Host Co-Evolution (Thomas Schoenfeld).
    • 7. Next-generation sequencing and metagenomic analysis; a universal diagnostic tool in plant pathology (Ian Adams).
    • 8. Direct Metagenomic Detection of Viral Pathogens in Human Specimens Using an Unbiased High-throughput Sequencing Approach (T. Nakaya).
  • The Soil Habitat.
    • 9. Soil based Metagenomics (R. Daniel).
    • 10. Methods in Metagenomic DNA, RNA and Protein Isolation from Soil (P. Gunasharan).
    • 11. Soil Microbial DNA Purification Strategies for Multiple Metagenomic Applications (Mark Liles).
    • 12. Application of PCR-DGGE and metagenome walking to retrieve full-length functional genes from soil (Morimoto).
    • 13. Actinobacterial diversity associated with Antarctic Dry Valley mineral soils (Cowan).
    • 14. Targetting major soil-borne bacterial lineages using large-insert metaenomic approaches (G. Kowalchuk).
    • 15. Novelty and uniqueness patterns of rare members of the soil biosphere (M. Elshahed).
    • 16. Extensive phylogenetic analysis of a soil bacterial community illustrates extreme taxon evenness and the effects of amplicon length, degree of coverage, and DNA fractionation on classification and ecological parameters (Holben WE).
    • 17. The Antibiotic Resistance: Origins, Diversity, and Future Prospects (Gerard Wright).
  • The Digestive Tract.
    • 18. Functional Intestinal Metagenomics (Michael Kleerebezem).
    • 19. Assessment and improvement of methods for microbial DNA preparation from fecal Samples (M. Hattori).
    • 20. Role of dysbiosis in inflammatory bowel diseases (Johan Dicksved).
    • 21. Culture independent analysis of the human gut microbiota and its activities (Kieran Tuohi).
    • 22. Complete genome of an uncultured endowsymbiont coupling nitrogen fixation to cellulolysis with protest cells in termite gut (Hongo).
    • 23. Cloning and identification of genes encoding acidic cellulases from metagenomes of buffalo rumen (Feng).
  • Marine and Lakes.
    • 24. Microbial diversity in the deep seas and the underexplored “rare biosphere” (David Mark Welsch and Susan Huse).
    • 25. Bacterial Community Structure and Dynamics in a Seasonally Anoxic Fjord (Steven J. Hallam).
    • 26. Adaptation to nutrient availability in marine microorganisms by gene gain and loss (A. Martini).
    • 27. Detection of large numbers of novel sequences in the metatranscriptomes of complex marine microbial communities (Jack Gilbert).
    • 28. Metagenomic approach studying the taxonomic and functional diversity of the bacterial community in a lacustrine ecosystem (Didier Debroas).
    • 29. Metagenomics of the marine subsurface: the first glimpse from the Peru Margin, ODP Site 122 (Jennifer Biddle).
    • 30. A targeted metagenomic approach to determine the ‘population genome’ of marine Synechoccus (D. J. Scanlan).
    • 31. Diversity and role of bacterial integron/gene cassette metagenome in extreme marine environments (Hosam Easa Elsaied and Akihiko Maruyama).
  • Other Habitats.
    • 32. The Olavius algarvensis metagenome revisited: lessons learned from the analysis of the low diversity microbial consortium of a gutless marine worm (Nicole Dubulier).
    • 33. Microbiome diversity in human saliva (Ivan Nasidze).
    • 34. Approaches to understanding population level functional diversity in a microbial community (D. Bhaya).
    • 35. A functional metagenomic approach for discovering nickel resistance genes from the rhizosphere of an acid mine drainage environment (JOSE Gonzales –Pastor).
    • 36. The Microbiome of Leaf-cutter Ant Fungus Gardens (Garret Suen).
    • 37. Diversity of archaea in terrestrial hot springs and role In ammonia oxidation (Chuanlun Zhang).
    • 38. Colinization of nascent, deep-sea hydrothermal vents by a novel Archaeal and Nanoarchaeal assemblage (S. Craig Cary).
    • 39. Analysis of the Metagenome from a biogas-producing microbial Community by means of Bioinformatucs Methods (Andreas Schlueter).
    • 40. Amplicon pyrosequencing analysis of endosymbiont population structure (Colleen Kavahagh).
    • 41. Investigative bacterial diversity along alkaline hot spring thermal gradients by barcoded pyrosequencing (Scott Miller and Michael Welzer).
    • 42. Genetic characterization of microbial communities living at the surface of building stones (J. C. Salvado).
    • 43. Novel aromatic degradation pathway genes and their organization as revealed by metagenomic analysis (Kentaro).
    • 44. Functional screening of a wide host-range metagenomic library from a wastewater treatment plant yields a novel alcohol/aldehyde dehydrogenase (Wexler).
    • 45. Aromatic hydrocarbon degradation genes from chronically polluted Subantarctic marine sediments (H. M. Dionisi).
    • 46. Isolation and characterization if alkane hydroxylases from a metagenomic llibrary of Pacific deep-sea sediment  (Fengping Wang).
  • Biocatalysts and Natural Products.
    • 47. Emerging Fields in Functional Metagenomics and its Industrial Relevance  – Overcoming Limitations and Redirecting the Search for Novel Biocatalysts (Wolfgang Streit).
    • 48. Carboxylesterases and Lipases from Metagenomes (Chow and Wolfgang Streit).
    • 49. Expanding small molecule functional megenomics through parallel screening of broad host-range cosmid environmental DNA libraries in diverse Proteobacteria (Sean Brady).
    • 50. Biomedicinals from the microbial metagenomes of marine invertebrates (Walter Dunlap).
    • 51. Molecular characterization of TEM-type beta-Lactamases identified in Cold-seep sediments of Edison Seamount (South of Lihir Island).
    • 52. Identification of Novel Bioactive Compounds from the Metagenome of the Marine (David Lejon).
    • 53. Functional Viral Metagenomics and the Development of New Enzymes for DNA and RNA Amplification and Sequencing (Thomas Schoenfeld).
  • Summary.
    • 54. Future of metagenomics, metatranscriptomics, metabolomics, metaproteomics and single cell analysis: A perspective (J. Tiedje).
    • 55. Darwin in the 21st Century: Natural Selection, Molecular Biology, and Species Concepts (Francisco Ayala).

Crosspost from microBEnet: Where is metagenomic analysis heading? Hopefully in directions suggested in this paper.

Figure 3 from Raes et al. Molecular Systems Biology 7 #473  doi:10.1038/msb.2011.6 

Just a quick post here.  I have been reading this paper: Toward molecular trait-based ecology through integration of biogeochemical, geographical and metagenomic data by Jeroen Raes et al. in Molecular Systems Biology.  This integration they try to pull off in the paper is to me where we need to move as a field (i.e., microbial ecology) in order to make full use of metagenomic data.  The paper provides a nice overview of microbial biogeography too.  Definitely worth a read.

Am crossposting this from the microBEnet blog (microBEnet is the site for the microbiology of the built environment network that I am building):

iEVOBIO Coming soon (6/20-21): Metagenomics, Biodiversity, Barcoding, Data Integration, and more

Last year, iEVOBIO was a fun, interesting meeting for many reasons (not the least of which is that I was the keynote speaker).  If you want to learn more about last years meeting check out my blog post: Summary of #iEVOBIO Day 1 #evolution #phylogenetics #informatics #opensource

But I note, that meeting was so, well, last year.  This year, the meeting is focusing on metagenomics, barcoding and biodiversity as well as data integration (see the meeting website for more information: iEvoBio: Home).  From the website:

In 2011 iEvoBio will have a a special focus session on metagenomics, barcoding, and biodiversity and the challenges that these new approaches raise for evolutionary informaticians. We now have over 6000 genomes and vast quantities of metagenomic sequences in the public domain, primarily from bacteria and archaea from many habitats. Various short sequences (e.g. barcodes) for quick identification of eukaryotes are emerging. The availability of this sequence data and ever-cheaper methods for producing it offer exciting opportunities for understanding molecular evolution and biodiversity. However, the data are growing faster than the infrastructure to support it. Thus there are informatics challenges for visualizing, analyzing, interpreting, and managing the data and the results from it. Moreover, the eukaryotic and microbial informatics communities have independent histories and approaches so synergy is not easy.

These challenges typify the intersection of fields that are the scope of iEvoBio. Speakers in this special session of iEvoBio will present their work and participate in a panel discussion. We will have 3 or 4 invited speakers at this session, including Neil Davies (Moorea Biocode project),Linda Amarral-Zettler (Marine Biological Laboratory Woods Hole), Holly Bik (Hubbard Center for Genome Studies, University of New Hampshire), and David Schindel (Barcode of Life, and Smithsonian Institution). After the talks, there will be an open panel discussion with all the speakers, including key note speaker Dawn Field (Center for Ecology and Hydrology at Oxford). We encourage you to attend this special session and participate in what we think will be a remarkably fruitful meeting.

Sounds pretty snazzy to me.  Plus, on top of all that, the meeting is very committed to Open Science:

iEvoBio and its sponsors are dedicated to promoting the practice andphilosophy of Open Source software development and reuse within the research community. For this reason, if a submitted talk concerns a specific software system for use by the research community, that software must be licensed with a recognized Open Source License, and be available for download, including source code, by a tar/zip file accessed through ftp/http or through a widely used version control system like cvs, Subversion, git, Bazaar, or Mercurial.

From an email I received:

More details about the program and guidelines for contributing content are available at http://ievobio.org.  You can also find continuous updates on the conference’s Twitter feed at http://twitter.com/iEvoBio.

iEvoBio is sponsored by the US National Evolutionary Synthesis Center (NESCent) in partnership with the Society for the Study of Ecolution (SSE) and the Society of Systematic Biologists (SSB). Additional support has been provided by the Encyclopedia of Life (EOL).

The iEvoBio 2011 Organizing Committee:
Rob Guralnick (University of Colorado at Boulder) (Co-chair)
Cynthia Parr (Encyclopedia of Life) (Co-chair)
Dawn Field (UK National Environmental Research Center)
Mark Holder (University of Kansas)
Hilmar Lapp (NESCent)
Rod Page (University of Glasgow)

Overall, it seems like this would be a good place to learn about the uses of high throughput sequencing in ecological and evolutionary studies.

(note – cross posting at http://www.microbe.net/2011/04/29/want-to-learn-what-eco-evo-types-are-doing-w-metagenomic-barcoding-data-go-to-ievobio-621-22/)

Wanted: Sample collections for the Earth Microbiome Project (EMP); help make an open Field Guide to the Microbes

The Earth Microbiome Project (EMP) is “a systematic attempt to characterize the global microbial taxonomic and functional diversity”.  A little more detail is provided on the project website:

The Earth Microbiome Project is a proposed massively multidisciplinary effort to analyze microbial communities across the globe. The general premise is to examine microbial communities from their own perspective. Hence we propose to characterize the Earth by environmental parameter space into different biomes and then explore these using samples currently available from researchers across the globe. We will analyze 200,000 samples from these communities using metagenomics, metatranscriptomics and amplicon sequencing to produce a global Gene Atlas describing protein space, environmental metabolic models for each biome, approximately 500,000 reconstructed microbial genomes, a global metabolic model, and a data-analysis portal for visualization of all information.

This project is certainly incredibly ambitious.  But hey, why not aim high?  The project is being coordinated by Jack Gilbert, Folker Meyer, and Rick Stevens from Argonne National Lab and the University of Chicago as well as Janet Jansson from Lawrence Berkeley National Lab and Rob Knight from U. Colorado Boulder. 

I note I am not unbiased here as I am one of the members of the EMP Steering Committee (others on the Committee include Jed Fuhrman, Janet Jansson, and Rob Knight).  

If you want to learn a bit more about the origins of the project, read this paper which is a report from a meeting where the idea came together as well as a follow up paper

Anyway, the reason I am writing this post is that the EMP is looking for collaborators and participants.  In particular we are hoping to line up lots of large sample collections that could be included in future analyses.  Currently a pilot project is being done on ~10,000 samples.  These will be characterized in a variety of ways including collection of metadata and also generation of sequence information for DNA from the samples (including both rDNA and metagenomic sequencing).  But the EMP wants more – much more.  So the EMP is recruiting anyone who either currently has, or could possibly collect, large collections of samples for microbial characterization along with rich contextual data about the samples.  The contextual data would ideally include as much information about the physical, chemical and biological parameters found at the time of sampling as possible. These parameters include, but are not limited to, nutrient concentrations, temperature, salinity, porosity, moisture content, time of day, latitude and longitude, depth below surface, elevation, pH, etc.

What types of samples are wanted?  Well, right now, just about anything could be useful.  Examples of things that could be useful include soil samples from a transect along the equator, filtered water from all lakes in Minnesota, deep sea sediment cores, filtered air from giant dust storms, microbial mats from hypersaline ponds, and so on.  The goal of the project is to develop and use massively high throughput methods to extract DNA and then generate sequence data from millions of samples.  There is of course a fund raising component and work is underway to secure funds to characterize the first collections.  Some corporations and institutes have already promised some support (e.g., Eppendorf and MoBio – see here). 

I note that the plans for the project are to be completely open in terms of data release.  Data that is generated will be released with no restrictions on use to everyone and anyone interested in utilizing it.

So if you have or might be able to collect some interesting samples and are interesting in participating in this open science initiative please contact submissions@earthmicrobiome.org to start this process rolling. We have already collated 55,000 from 30 PIs across the globe and this number is growing rapidly.

Be part of a revolution – open access data analysis to help define the microbial world which supports life on this planet. Plus, its better than working alone.

Note there is an upcoming meeting focusing on the EMP.  The meeting is in Shenzen June 13-15th.  This is a good place to have the meeting as the Beijing Genome Institute is a key partner in the EMP.

A final note this project is a key step in my dream – to have a field guide to the microbes.  It will not be all that is needed but it will be a good component. 

For some additional discussions of the EMP see:

And most importantly – sign up to provide samples …submissions@earthmicrobiome.org

Interesting PLoS One paper on local assembly from short reads by "tagging" DNA via restriction enzymes

Quick one here. Interesting paper from Paul Etter et al. from Eric Johnson’s lab at U. Oregon in PLoS ONE: PLoS ONE: Local De Novo Assembly of RAD Paired-End Contigs Using Short Sequencing Reads



Here is the abstract:

Despite the power of massively parallel sequencing platforms, a drawback is the short length of the sequence reads produced. We demonstrate that short reads can be locally assembled into longer contigs using paired-end sequencing of restriction-site associatedDNA (RAD-PE) fragments. We use this RAD-PE contig approach to identify single nucleotide polymorphisms (SNPs) and determine haplotype structure in threespine stickleback and to sequence E. coli and stickleback genomic DNA with overlapping contigs of several hundred nucleotides. We also demonstrate that adding a circularization step allows the local assembly of contigs up to 5 kilobases (kb) in length. The ease of assembly and accuracy of the individual contigs produced from each RAD site sequence suggests RAD-PE sequencing is a useful way to convert genome-wide short reads into individually-assembled sequences hundreds or thousands of nucleotides long.”


Note as they note in the paper “Competing interests: E.A.J. has patents filed on the RAD marker, and partial interest in a company commercializing the system. This does not alter the authors’ adherence to all the PLoS ONE policies on sharing data and material” This seems like it would have potential in metagenomic applications.  I note, we are working on a similar approach – and kind of got scooped here in a way. Hope their patent does not limit what we can do.

The story behind the story of my new #PLoSOne paper on "Stalking the fourth domain of life" #metagenomics #fb

Well, here goes.

This is a post about a paper that has been a long long time coming. Today, a paper of mine is being published in PLoS One. The paper is titled “Stalking the Fourth Domain in Metagenomic Data: Searching for, Discovering, and Interpreting Novel, Deep Branches in Marker Gene Phylogenetic Trees” and is available at http://dx.plos.org/10.1371/journal.pone.0018011. (or if that link does not work you can get a copy here). This paper represents something I started a long time ago and I am going to try to describe the story behind the paper here.

I note – we are not doing a press release for the paper, for a few reasons. But one of them is that, well, I am starting to hate press releases. So I guess this is kind of my press release. But this will be a bit longer than most press releases. I note – my key fear here is that somehow in my communications with the press or in our text in the paper or in this post I will overstate our findings. Here is the punchline – we found some very phylogenetically novel forms of phylogenetic marker genes in metagenomic data. We do not have a conclusive explanation for the origin of these sequences. They may be from novel viruses. The They may be ancient paralogs of the marker genes. Or they may be from a new branch of cellular organisms in the tree of life, distinct from bacteria, archaea or eukaryotes. I think most likely they are from novel viruses. But we just don’t know.

UPDATE: Am posting some links here to news stories/blogs about our paper





    First – a summary of what we did.

    In the paper, we searched through metagenomic data (sequences from environmental samples) for phylogenetically novel sequences for three standard phylogenetic marker genes (ss-rRNA, recA, rpoB). We focused on sequences from the Venter Global Ocean Sampling data set because, well, we started this analysis many years ago when that was the best data set available (more on this below). What we were looking for were evolutionary lineages of these genes that were separate from the branches that corresponded to the three known “Domains” of life (bacteria, archaea and eukaryotes).

    To search for such novel lineages in the metagenomic data, we built evolutionary trees using these genes where we included sequences from known organisms (and viruses) as well as sequences from metagenomic data. We then looked through the trees for groups that were both phylogenetically novel and included only environmental data (i.e., they were new compared to known organisms or viruses). This method did not work very well for rRNA sequences (largely because making high quality alignments of short phylogenetically novel rRNA sequences was difficult – more on this below). But with RecA and RpoB homologs we were able to generate what we believe to be robust phylogenetic trees. And in these trees we found evidence for phylogenetically very novel sequences in environmental data.

    Figure 1. Phylogenetic tree of the RecA superfamily. 

    Figure 3. Phylogenetic tree of the RpoB superfamily

    We then propose and discuss four potential mechanisms that could lead to the existence of such evolutionarily novel sequences. The two we consider most likely are the following

    1. The sequences could be from novel viruses
    2. The sequences could be from a fourth major branch on the tree of life

    Unfortunately, we do not actually know what is the source of these sequences. So we cannot determine which of the theories is correct. Obviously if there is a novel lineages of cellular organisms out there, well, that would be cool. But we have no evidence right now if that is what is going on. Personally, I think it is most likely that these novel sequences are from weird viruses. But as far as we can tell, they truly could be from a fourth major branch of cellular organisms and thus even though we did not have the story completely pinned down, we decided to finally write up the paper to get other people to think about this issue.

    Below I give all sorts of other details about the project in the following areas

    • The history of the project 
    • More detail on what is in the paper 
    • Follow up analysis and rapid posting with google Know 
    • Data deposition in Dryad 
    • Who was involved 
    • UPDATE: Funding for this work



    The history of the project

    Well, this is one of those projects for which the history is hard to explain. We started this work in 2004 when I was helping Venter and colleagues analyze the Sargasso Sea metagenome data. I was working at TIGR in 2003, which are the time was a sister institute to some of the institutes affiliated with the J. Craig Venter Institute (JCVI) (it was a complicated time). Craig had led a project to do a massive amount of shotgun sequencing of DNA isolated from the Sargasso Sea, which had been the site of many previous studies of uncultured microbes. And Craig, as well as some of the people working with him including John Heidelberg who was at TIGR, had asked me to help in analysis of the data. So I eventually went to a meeting about the project and got involved. It was quite exciting and I put a lot of effort into helping analyze the data.

    As part of my work on the project, I and Martin Wu and Dongying Wu did a variety of phylogenetic studies of genes and gene families. One of these, was a phylogenetic analysis of proteorhodopsin homologs showing massively more diversity in the Sargasso data than in the PCR experiments done by Delong and Beja and others.

    Figure 7 from Venter et al. 2004. 

    We also did the first “phylotyping” in metagenomic data using genes other than rRNA. We built trees of bacterial ss-rRNAs, RecAs, RpoBs, HSP70s, EF-Tus and EF-Gs and then assigned each sequence to a phylum from the trees. In this analysis we found a variety of interesting things. 

    Figure 6 from Venter et al. 2004. 
    One thing I did not include in the Sargasso paper was an analysis I did of RecA homologs where I tried to include ALL RecA-like genes from bacteria, archaea, eukaryotes and viruses. The trees I made were a bit unusual but I was not sure that the alignments I had made were robust or that I had found all the RecA-like genes of interest so I did not even show this to Craig et al. at the time.
    UPDATE: I note – our work on this project was supported by a grant from the NSF Assembling the Tree of Life program that was awarded to me and Naomi Ward and Karen Nelson. Those funds supported the development of many of the informatics tools we used in this analysis and Martin and Dongying were both working on that project.

    After the Sargasso paper was published in 2004 though, I continued to fester about the RecA trees. And I wondered – if instead of trying to classify bacterial sequences into phyla, what if I tried to look for RecAs, rRNAs and other genes that were completely new branches in the tree of life? I got the chance to start to play with this concept again when Venter and crew asked me to help analyze the data coming out of the Global Ocean Sampling project. Again, this project was very exciting and interesting.


    As part of the project, I helped Shibu Yooseph and others look into whether the GOS data revealed any completely new types of functionally interesting genes, much like I had shown for proteorhodopsin in the Sargasso data.  


    Figure 7 from Yooseph et al. 2007 . Phylogenies Illustrating the Diversity Added by GOS Data to Known Families That We Examined 






    And again my mind started wandering towards the question of “OK – so – if there are all these very unusual and novel functionally interesting genes, what about looking for unusual and very novel phylogenetic marker genes”? So finally, I got back to work on the issue.

    And so I built a better RecA tree by first pulling out all possible homologs of RecA and RecA like proteins from the GOS data and then building an alignment and a tree. And there they were. Some very f*%&$ novel RecAs – distinct from any previously known RecA like proteins as far as I could tell. And so with help from Dongying and the JCVI crew, we started building a story about novel RecAs. And then we looked at RpoBs. And found novel ones too. And in mid 2006 while Shibu and Doug worked on their papers that were to be submitted to PLoS Biology and I worked on a review paper too, I told Emma Hill (who has since changed her name to Emma Ganley due to some sort of wedding thing) at PLoS Biology about the an analysis that was consistent with the existence of a fourth domain of life. No overstating our findings really – just that we found very novel phylogenetic marker genes. And that I was working on a paper on it. But alas I never got it done, though I was happy to have convinced Venter to send the GOS papers to PLoS Biology and I think the papers that came out were good. Among the papers were my review (Environmental Shotgun Sequencing: Its Potential and Challenges for Studying the Hidden World of Microbes, Doug Rusch’s diversity paper The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific and Shibu’s protein family paper The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families as well as many others as part of the Ocean Metagenomics Collection at PLoS.

    And in the midst of all of this, we had our first child and we wanted to move back to Northern California to be closer to family (my wife’s family is all in the Bay Area and my sister and brother Michael were in N. Cal too). So I applied for jobs and eventually took at job at UC Davis and we moved to Davis. Needless to say, all of that put a bit of a crimp in my work productivity. And once I was up and running at Davis, it just took a long time to get back to the searching for novel deep branches in the tree of life. But finally, we did it (with periodic prodding from Craig Venter). And we put together a paper and got it submitted to PLoS One in October. The reviews were very positive and enormously helpful. And we finally got a revision in January and it was officially accepted in February 2011. Only some seven years after my first work on the project. Whew.

    More detail of what is in the paper
    Well, I am going to be posting here some additional detail on what is in the paper.



    Why we punted on analysis of very novel rRNAs.

    The problem with rRNA is that the sequences that come from environmental samples are not complete (i.e. they only correspond to portions of the rRNA genes). Unfortunately, this makes a key step in phylogenetic analysis difficult – the alignment of sequences. We actually found about 200 rRNA sequences that seemed unusual in a phylogenetic sense. However, we were not convinced that the alignments of these fragments to other rRNAs was robust. This is because the alignment of rRNAs is best done making use of the base pairing secondary structure of the molecule and not the base sequence (i.e., primary structure).

    With only rRNA fragments, we could not use the secondary structure to do the alignments because you need to whole molecule to determine the best folding. Combined with the fact that we were searching for very distantly related ribosomal RNAs which would be hard to align even if we had the whole molecule, we were stuck for a bit. It seemed impossible to look for really novel organisms.
    So that is when we turned to other genes. The key for this is that there are protein coding genes that are universal and that for known organisms show similar patterns to rRNA in trees. In fact, in 1995 I wrote a paper showing that trees of RecA were very similar to trees of rRNA. RpoB is also considered a very robust phylogenetic marker. For organisms that we have in the lab (i.e., cultured) – many people use these other genes for phylogenetic analysis. rRNA has been very important in part because of the ease with which one can PCR amplify it from environmental samples and the fact that it is very hard to PCR amplify protein coding genes from the environment. Metagenomics changes this. With random sequencing, you get data from all genes. This means we can pick and choose genes to analyze for phylogenetic analysis and do not have to rely on rRNA.

    So we went after RecA first, because it has been shown to be a good phylogenetic marker for studies of the tree of life. And we found some very novel branches in the RecA tree. And after analyzing these and convincing ourselves that they were indeed phylogenetically very novel we went after RpoB. And also found very novel branches.

    So the phylogenetic analysis I think is very robust.

    RecA and RpoB as phylogenetic markers

    Many genes have been used as alternatives to rRNA genes to build “Trees of Life” including all organisms. Each has their own flavors of advantages and drawbacks. Two commonly used ones are the RecA and RpoB superfamilies.

    The many possible explanations for finding novel forms of phylogenetic marker genes

    The phylogenetically novel phylogenetic marker genes we found could have many explanations including that they could be ancient paralogs of these genes (but not found in any genomes we have available), they could be from viruses, or they could be from a novel branch on the tree of life. Or our trees could be bad. We think the latter is somewhat unlikely as our analysis has many lines of support. For example our RecA trees are very similar to those from a comprehensive study from M. Nei’s lab except they did not include the metagenomic data. But I guess it is still a possibility that our trees are biased in some way (e.g., by long branch attraction or bad alignments)

    Follow up analysis and rapid posting via Google Knol

    Amazingly and a bit sadly, I think we rushed the paper out. We left out one thing partly by accident – we had done an analysis of the locations from which these novel RecA and RpoB sequences had come. And somehow, in our final push to get the paper out, we left this out. I will be posting this information as soon as possible here and on the PLoS One site.

    In addition, after submitting the revision of our paper, we realized that we might be able to do a deeper analysis on one aspect of the work – how RpoB homologs from unusual DNA viruses compared to our novel sequences. We had included some RpoBs from DNA viruses in our analyses but not all that were available. So Dongying Wu did a very rapid additional analysis, adding some additional RpoB homologs to our alignment and making a tree of them. We then wrote a Google Knol about this new tree and submitted the Knol to PLoS Currents “Tree of Life” where it is currently in review. We are publishing the preprint of this Knol to make it available to all even while it is in review.


    Figure 2 from Wu and Eisen submitted. 

    Data availability

    There is a move afoot to make sure all data/tools associated with publications are readily available. We used publicly available sequence data and as much as possible publicly available tools for our work . We are trying to release as much as possible to allow people to re-analyze our work and to do any of the work themselves. We have therefore made use of the Dryad Data deposition service to post some of this material (see http://datadryad.org/handle/10255/dryad.8385).

    Who was involved

    • Dongying Wu a brilliant “Project Scientist” in my lab led the project (Project Scientist is one of the UC positions that is like what others call “Senior Scientist”). Dongying is simply one of the best bioinformaticians/computational biologists I have ever met. He was first author on many key papers from my lab including the Genomic Encyclopedia paper that came out last year and the glassy winged sharpshooter symbionts paper that came out a few years ago. Dongying worked in my group at TIGR and moved with me to UC Davis and currently splits his time between UC Davis and the DOE Joint Genome Institute. 
    • Martin Wu. Martin is an Assistant Professor at the University of Virginia. Prior to that he was a Project Scientist in my lab at Davis and a post-doc in my lab at TIGR. He is also a phenomenal bioinformatician / computational biologist. He developed the AMPHORA software in my lab and also led many genome projects (back when sequencing a genome was hard …) including that of the first Wolbachia genome and that of a very unusual bug Carboxydothermus hydrogenoformans. Martin helped with some of the genome analyses as part of this work. 
    • Aaron Halpern, Doug Rusch and Shibu Yooseph are all bioinformaticians from the J. Craig Venter Institute (Aaron is no longer there). All three helped with different aspects of dealing with and analyzing the GOS data and all three have been remarkably patient as this work dragged on and on. 
    • Marv Frazier from the JCVI was helpful in the initial set up and conceptualization of the project. 
    • J. Craig Venter is, well, Craig Venter, and he was involved in multiple aspects of the project including thinking about how and where to look for unusual sequences and interpreting some of the results.

    UPDATE: Funding for this work

    Most of my labs early work on this project was supported by a grant we had from the Assembling the Tree of Life program at the National Science Foundation (grant 0228651 to me and Naomi Ward). In that project we were working on sequencing and analyzing genomes from phyla of bacteria for which genomes were not available at the time. As part of this work we were designing methods to build phylogenetic trees from metagenomic data because we thought that our new genomes would be very useful in helping analyze metagenomic reads and figure out from which phyla they came. Later work on the project was supported by a grant to me, Jessica Green and Katie Pollard from the Gordon and Betty Moore Foundation (grant 1660).

    Some questions that might be asked and some answers (based in part on questions I have gotten from reporters). Note if you have other questions please post them here or on the PLOS One site for the paper.

    • Why no press release? Well, in part, because I sent information too late (shocking I know) to the Davis Press Office. But also because they have gotten suddenly busy with some Japan earthquake related actions. But also because, well, I really hate a lot of press releases. And finally, my brother had dinner with Carl Zimmer recently and apparently they discussed the possibility of having no press releases associated with papers. So here goes …. 
    • Really – what took so long? I would like to say the US Government made us hold back on publishing this until they could look into whether Venter collected ocean data from Roswell, NM or not. But really, the story above is true. We just did not get it done earlier. 
    • Why do you not know the source of the DNA (i.e., cells, viruses, etc)? This is why there was a six year wait between discovery and writing this up. We kept thinking we would be able to find the organisms but since I moved from TIGR and started a new job, we just never got around to getting to the source. We therefore decided to open this up to others who will hunt for the source by writing up the paper. 
    • Why did you not rename the Unknown 2 group in the RecA tree? We should have renamed our group “Thaumarchaeota” or something like that. When we did the initial analysis our group was novel. And then a few years ago a few groups obtained data from what is thought to be the third major lineage of Archaea – referred to by some as Thaumarchaeota. This is to go with the Euryarchaeota and Crenarchaeota. See http://www.ncbi.nlm.nih.gov/pubmed/20598889 for example. 
    • One of the clades in the RecA tree (XRCC2) seems out of place phylogenetically. I can see how that is confusing. The XRCC2 clade is very weird and hard to figure out. It is not the “normal” eukaryotic genes – those are the Rad51/DMC1 genes. One complication with the RecA family is that there have been duplication events to go with the species evolution. And thus eukaryotes have Rad51, DMC1, Rad51B, Rad51C, Rad57, XRCC3 and XRCC2. We tried to figure out where the XRCC2 group should go but it just was hard to place. The statistical support for its position (we used a method called bootstrapping) is low (note the lack of a number on the node where the branch leading to XRCC2 connects to the base of the tree). Most likely that group should be placed with some of the other eukaryotic groups. However, it seems likely that there was a duplication in the lineage leading up to the ancestor of eukaryotes and archaea (some studies have indicated they share a common ancestor to the exclusion of bacteria). Such a duplication would explain why basically all archaea have a RadA and and RadB and all / most eukaryotes have multiple paralogs as well. 
    • The Unknown 1 group in the RpoB RecA tree seems to group with phage. What can you say about that? We think unknown 1 is potentially of viral origin but still cannot tell. The fact that is clusters with RecA superfamily members from phage suggests this but it is distant enough from known phage for us to not be confident in any predicted origin. As for derivative forms vs. independent branch – this is one of the big questions about viruses these days. Many viruses encode homologs of “housekeeping” genes found across bacteria, archaea and eukaryotes. And in many cases the viral versions of these genes appear to phylogenetically very novel. This is why the people studying mimivirus (which we refer to) suggest some viruses may in fact represent a fourth branch on the tree of life. It is possible that some viruses are in fact reduced forms of what were once cellular organisms – akin to parasitic intracellular species of bacteria possibly. 
    • Why are these phylogenetically novel sequences so low in abundance? This is a key question. I think it would be easy to come up with a theory for these being rare or these being common. They might be rare if their niche is very limited today. Or they might be rare because they could not be very competitive with other organisms. Or they could be rare because they require some unusual interactions with other taxa. In addition, we have only looked carefully at ocean water samples. If these are common somewhere else (e.g., hotsprings, deep subsurface, etc) we would not yet have figured that out. We are looking at additional metagenomic data right now to see fi we can find any locations where relatives of these genes are more common

    Some related papers by others worth looking at

    Some related papers by me possibly worth looking at

    Some related blog posts I have written over the years

      http://friendfeed.com/treeoflife/5535e8ed/story-behind-of-my-new-plosone-paper-on-stalking?embed=1

      Dongying Wu, Martin Wu, Aaron Halpern, Douglas B. Rusch, Shibu Yooseph, Marvin Frazier,, & J. Craig Venter, Jonathan A. Eisen (2011). Stalking the Fourth Domain in Metagenomic Data: Searching for, Discovering, and Interpreting Novel, Deep Branches in Marker Gene Phylogenetic Trees PLoS One, 6 (3) : 10.1371/journal.pone.0018011

      Though I generally love NCBI, the Sequence/Short Read Archive (SRA) seems to have issues; what do others think?

      Well, here goes. Hope to not get people from NCBI too pissed off here. Overall, I think NCBI is invaluable: GenBank. PubMed. PubMed Central (PMC) (well, I have some complaints about that but let’s not get into those here — I still like it), BLAST (Basic Local Alignment Search Tool) and a plethora of other tools, databases and resources. Generally, money well spent.

      However, one database from NCBI is driving me a bit wacky these days. This is the Sequence Read Archive (SRA). Known to some as the “Short Read Archive” this database is supposedly for storing “sequencing data from the next generation of sequencing platforms including Roche 454 GS System®, Illumina Genome Analyzer®, Life Technologies AB SOLiD System® , Helicos Biosciences Heliscope®;, Complete Genomics®, and Pacific Biosciences SMRT®.”

      It certainly seems to be used for that function. But alas, storing sequence is not the only need here. Recovering sequence and making use of it is really the key. And this is the area I have been having trouble with (especially related to environmental studies like rRNA PCR and metagenomics). Rather than go on about my particular issues here (and thus possibly biasing the discussion too much), I am wondering what others think of the SRA? Usability? Ease of deposition? Ease of extraction? Missing features? Things it does or does not do well? Do we need a new system for environmental projects?

      Any and all comments welcome here or on twitter or on Friendfeed or wherever. See Friendfeed stream below:

      http://friendfeed.com/treeoflife/4f09e201/though-i-generally-love-ncbi-sequence-short?embed=1

      Here are some comments so far from twitter

      • digitalbio Sandra Porter I agree. RT @phylogenomics: Though I generally love NCBI, the Sequence/Short Read Archive (SRA) seems to hav… (cont) http://deck.ly/~XM75A
      • lswenson Luke Swenson @phylogenomics I was JUST trying to navigate the SRA! There’s no help section to be found, and forget about depositing sequences!
      • audyyy Davis-Richardson @phylogenomics I can never tell if my submission went through without emailing support. Also, no FASTQ support?
      • cabbageRed Rich C .@phylogenomics I agree, the SRA doesn’t seem to be the easiest repository to search with what I believe to be “typical” NGS queries

      Metagenomics/bioinformatics/microbiology job of the week: Chisholm lab at MIT

      Good at bioinformatics? Looking for a job? Check out this one in Penny Chisholm’s lab at MIT:

      Chisholm has done some incredible work on marine microbes. Here are some of her “Open Access” papers to browse:

      Sure she does publish in some other places but does publish a lot in PLoS and other Open Access journals. If you get the job, maybe you can convince her to move more towards publishing everything in Open Access journals — her stuff is very cool — the more of it that is Open Access the better …

      Phylogeny rules:


      I am a coauthor on a new paper in PLoS Computational Biology I thought I would promote here.  The full citation for the paper is:

      The paper discusses a new software program “phylOTU” which is for phylogenetic-based identification of “operational taxonomic units”, which are also known as OTUs.   What are OTUs?  They are basically clusters of closely related sequences that are used to represent something akin to a species.  OTUs are used a lot in environmental microbiology b/c one key way to study microbes in the environment is through extraction and sequencing of DNA.  Traditionally this has been done through PCR amplification and sequencing of one particular gene (ss-rRNA).  Now it is also being done through random sequencing of all DNA from environmental samples (so called metagenomics).

      Anyway – the paper is (of course) fully open access and you can read it for more detail.  Just thought I would post a little here about it.  The paper / project was led by Tom Sharpton, a post doc in Katie Pollard‘s lab at UCSF working on a collaborative project between Katie’s lab, my lab, and Jessica Green‘s lab at U. Oregon (and recently Martin Wu’s new lab at U. Virginia – he was in my lab previously).  This collaborative project even has a name “iSEEM” which stands for integrating statistical, evolutionary and ecological approaches to metagenomics.  This project has been generously supported by the Gordon and Betty Moore Foundation (via a grant for which I am PI).


      Some little tidbits of possible interest about the project

      Sharpton, T., Riesenfeld, S., Kembel, S., Ladau, J., O’Dwyer, J., Green, J., Eisen, J., & Pollard, K. (2011). PhylOTU: A High-Throughput Procedure Quantifies Microbial Community Diversity and Resolves Novel Taxa from Metagenomic Data PLoS Computational Biology, 7 (1) DOI: 10.1371/journal.pcbi.1001061

      Quick post – congrats to Jill Banfield, environmental #microbiology guru, for winning Franklin Medal & L’Oreal-UNESCO award

      Very cool news from UC Berkeley.  Jill Banfield, one of the greats of environmental microbiology, is going to receive both the Benjamin Franklin Medal in Earth and Environmental Science and the L’Oreal-UNESCO “for women in science” award.

      I am very happy to see this.  Jill has done some amazing work in multiple areas of environmental microbiology and continues to push frontiers in technology and science.  And since I am always on an Open Access crusade here, here are some links to some of her recent papers that are free in Pubmed Central: