Draft blog post cleanup #1: Divide and Conquer to Find Orthologs

OK – I am cleaning out my draft blog post list.  I start many posts and don’t finish them and then they sit in the draft section of blogger.  Well, I am going to try to clean some of that up by writing some mini posts.  Here is the first —

Saw an interesting paper worth checking out:
PLoS ONE: Calculating Orthologs in Bacteria and Archaea: A Divide and Conquer Approach

It describes not only a way to speed up continual ortholog annotation in bacterial and archaeal genomes but also is linked to an ongoing open code development project.

Here is the abstract:

Among proteins, orthologs are defined as those that are derived by vertical descent from a single progenitor in the last common ancestor of their host organisms. Our goal is to compute a complete set of protein orthologs derived from all currently available complete bacterial and archaeal genomes. Traditional approaches typically rely on all-against-all BLAST searching which is prohibitively expensive in terms of hardware requirements or computational time (requiring an estimated 18 months or more on a typical server). Here, we present xBASE-Orth, a system for ongoing ortholog annotation, which applies a “divide and conquer” approach and adopts a pragmatic scheme that trades accuracy for speed. Starting at species level, xBASE-Orth carefully constructs and uses pan-genomes as proxies for the full collections of coding sequences at each level as it progressively climbs the taxonomic tree using the previously computed data. This leads to a significant decrease in the number of alignments that need to be performed, which translates into faster computation, making ortholog computation possible on a global scale. Using xBASE-Orth, we analyzed an NCBI collection of 1,288 bacterial and 94 archaeal complete genomes with more than 4 million coding sequences in 5 weeks and predicted more than 700 million ortholog pairs, clustered in 175,531 orthologous groups. We have also identified sets of highly conserved bacterial and archaeal orthologs and in so doing have highlighted anomalies in genome annotation and in the proposed composition of the minimal bacterial genome. In summary, our approach allows for scalable and efficient computation of the bacterial and archaeal ortholog annotations. In addition, due to its hierarchical nature, it is suitable for incorporating novel complete genomes and alternative genome annotations. The computed ortholog data and a continuously evolving set of applications based on it are integrated in the xBASE database, available at http://www.xbase.ac.uk/.

Definitely worth checking out.

Bacteria & archaea don’t get no respect from interesting but flawed #PLoSBio paper on # of species on the planet

ResearchBlogging.org

Uggh. Double uggh. No no. My first blog quadruple uggh. There is an interesting new paper in PLoS Biology published today. Entitled “How many Species Are There on Earth and in the Ocean?” PLoS Biol 9(8): e1001127 – it is by Camilo Mora, Derek Tittensor, Sina Adl, Alastair Simpson and Boris Worm. It is accompanied by a commentary by none other than Robert May, one of the greatest Ecologists of all time: PLoS Biology: Why Worry about How Many Species and Their Loss?

I note – I found out about this paper from Carl Zimmer who asked me if I had any comments.  Boy did I.  And Zimmer has a New York Times article today discussing the paper: How Many Species on Earth? It’s Tricky.  Here are my thoughts that I wrote down without seeing Carl’s article, which I will look at in a minute.

The new paper takes a novel approach to estimating the number of species. I would summarize it but May does a pretty good job:
“Mora et al. [4] offer an interesting new approach to estimating the total number of distinct eukaryotic species alive on earth today. They begin with an excellent survey of the wide variety of previous estimates, which give a range of different numbers in the broad interval 3 to 100 million species”
….
“Mora et al.’s imaginative new approach begins by looking at the hierarchy of taxonomic categories, from the details of species and genera, through orders and classes, to phyla and kingdoms. They documented the fact that for eukaryotes, the higher taxonomic categories are “much more completely described than lower levels”, which in retrospect is perhaps not surprising. They also show that, within well-known taxonomic groups, the relative numbers of species assigned to phylum, class, order, family, genus, and species follow consistent patterns. If one assumes these predictable patterns also hold for less well-studied groups, the more secure information about phyla and class can be used to estimate the total number of distinct species within a given group.”
The approach is novel and shows what appears to be some promise and robustness for certain multicellular eukaryotes. For example, analysis of animals shows a reasonable leveling off for many taxonomic levels:

Figure 1. Predicting the global number of species in Animalia from their higher taxonomy. (A–F) The temporal accumulation of taxa (black lines) and the frequency of the multimodel fits to all starting years selected (graded colors). The horizontal dashed lines indicate the consensus asymptotic number of taxa, and the horizontal grey area its consensus standard error. (G) Relationship between the consensus asymptotic number of higher taxa and the numerical hierarchy of each taxonomic rank. Black circles represent the consensus asymptotes, green circles the catalogued number of taxa, and the box at the species level indicates the 95% confidence interval around the predicted number of species (see Materials and Methods).
From Mora C, Tittensor DP, Adl S, Simpson AGB, Worm B (2011) How Many Species Are There on Earth and in the Ocean? PLoS Biol 9(8): e1001127. doi:10.1371/journal.pbio.1001127

They also do a decent job of testing their use of higher taxon discovery to estimate number of species.  Figure 2 shows this pretty well.

Figure 2. Validating the higher taxon approach. We compared the number of species estimated from the higher taxon approach implemented here to the known number of species in relatively well-studied taxonomic groups as derived from published sources [37]. We also used estimations from multimodel averaging from species accumulation curves for taxa with near-complete inventories. Vertical lines indicate the range of variation in the number of species from different sources. The dotted line indicates the 1∶1 ratio. Note that published species numbers (y-axis values) are mostly derived from expert approximations for well-known groups; hence there is a possibility that those estimates are subject to biases arising from synonyms.

So all seems hunky dory and pretty interesting.  That is, until we get to the bacteria and archaea.  For example, check out Table 2:

Table 2. Currently catalogued and predicted total number of species on Earth and in the ocean.

Their approach leads to an estimate of 455 ± 160 Archaea on Earth and 1 in the ocean.  Yes, one in the ocean.  Amazing.  Completely silly too.  Bacteria are a little better.  An estimate of 9,680 ± 3,470 on Earth and 1,,320 ±436 in the oceans.  Still completely silly.

Now the authors do admit to some challenges with bacteria and archaea. For example:

We also applied the approach to prokaryotes; unfortunately, the steady pace of description of taxa at all taxonomic ranks precluded the calculation of asymptotes for higher taxa (Figure S1). Thus, we used raw numbers of higher taxa (rather than asymptotic estimates) for prokaryotes, and as such our estimates represent only lower bounds on the diversity in this group. Our approach predicted a lower bound of ~10,100 species of prokaryotes, of which ~1,320 are marine. It is important to note that for prokaryotes, the species concept tolerates a much higher degree of genetic dissimilarity than in most eukaryotes [26],[27]; additionally, due to horizontal gene transfers among phylogenetic clades, species take longer to isolate in prokaryotes than in eukaryotes, and thus the former species are much older than the latter [26],[27]; as a result the number of described species of prokaryotes is small (only ~10,000 species are currently accepted).

But this is not remotely good enough from my point of view. Their estimates of ~ 10,000 or so bacteria and archaea on the planet are so completely out of touch in my opinion that this calls into question the validity of their method for bacteria and archaea at all. 
Now you may ask – why do I think this is out of touch. Well because reasonable estimates are more on the order or millions or hundreds of millions, not tens of thousands. To help people feel their way through the literature on this I have created a Mendeley group where I am posting some references worth checking out.

I think it is definitely worth looking at those papers.  But just for the record, some quotes might be useful.  For example, Dan Dykhuizen writes

we estimate that there are about 20,000 common species and 500,000 rare species in a small quantity of soil or about a half million species.

And Curtis et al write:

We are also able to speculate about diversity at a larger scale, thus the entire bacterial diversity of the sea may be unlikely to exceed 2 × 10^6, while a ton of soil could contain 4 × 10^6 different taxa.

Are their estimates perfect?  No surely not.  But I think without a doubt the number of bacterial and archaeal species on the planet is in the range of millions upon millions upon millions.  10,000 is clearly not even close.  Sure, we do not all agree on what a bacterial or archaeal species is.  But with just about ANY definition I have heard, I think we would still count millions.

Given how horribly horribly off their estimates are for bacteria and archaea, I think it would have been better to be more explicit in admitting that their method probably simply does not work for such taxa right now.  Instead, they took the approach of saying this is a “lower bound”.  Sure.  That is one way of dealing with this.  But that is like saying “Dinosaurs lived at least 500 years ago” or “There are at least 10 people living in New York City” or “Hiking the Appalachian Trail will take at least two days.”  Lower bounds are only useful when they provide some new insight.  This lower bound did not provide any.
Mind you, I like the paper.  The parts on eukaryotes seem quite novel and useful.  But the parts of bacteria and archaea are painful.  Really really painful.
Mora, C., Tittensor, D., Adl, S., Simpson, A., & Worm, B. (2011). How Many Species Are There on Earth and in the Ocean? PLoS Biology, 9 (8) DOI: 10.1371/journal.pbio.1001127

Microbes do some strange things: splitting and permuting tRNAs

Figure 1 – Predicted secondary structures of trans-spliced and permuted precursor tRNAs
(a) Mature tRNAAsp(GUC) in A. pernix and T. aggregans are formed by joining the 5half and the 3half at position 37/38 after splicing at the bulge-helix-bulge (BHB) motif. (b) The 5half and the 3half of trans-spliced tRNALys(CUU) in S. hellenicus and S. marinus join at position 30/31, same as the previously identified split tRNALys(CUU) in N. equitans [5]. (c) Circularized permuted tRNAiMet(CAU) and tRNATyr(GUA) in T. pendens have the 3half located upstream of the 5half separated by intervening sequences represented in green. The two fragments join at position 59/60, same as the T-Ψ-C loop permuted tRNAs in the red alga C. merolae [9]. Pre-tRNAAla(UGC) in C. merolae is shown for comparison. 5half of tRNA transcripts are represented in blue, the 3halves in orange. Black arrows indicate positions of splicing. Anticodons are boxed in light blue.

I was woefully unaware of some of the tRNA shenanigans going on in microbes until reading this paper: Genome Biology | Abstract | Discovery of permuted and recently split transfer RNAs in Archaea from Patricia Chan, Aaron Cozen and Todd Lowe. Life is pretty weird and wacky sometimes, even in components of cells that are considered “core” parts of the machinery of life. Go figure. It is worth a read …

Archaea in the news – a growing trend

Archaea, the so-called “third” branch in the tree of life, don’t get in the news much but good when they do and for some reason, they are getting in the news more and more these days.  See below for some links to news stories.

Most important paper ever in microbiology? Woese & Fox, 1977, discovery of archaea

Well, today in my “Microbial phylogenomics” class at UC Davis we are discussing what I think might be the most important paper (well, actually, series of papers) in the history of microbiology. The papers are the ones where Carl Woese, George Fox and colleagues outline the evidence for the existence of a “hidden” third major branch in the tree of life – what is now known as the archaea. The evidence for this third branch was first laid out in a series of papers in 1977 including:

Now Woese, Fox and others in Woese’s group had been leading up to these publications in ways for years (I note, there were some pretty incredible people involved in these studies in the years before 1977 too including Mitch Sogin, now at MBL, David Stahl, Chuck Kurland, Norm Pace, etc but that is another story). They had been determining the nucletide sequences of small fragments of rRNAs from different species, especially from different organisms that did not have nuclei – the so-called “prokaryotes”. And they were using these sequences to infer the phylogenetic relationships among these microbes.
Consider for example, the paper by SJ Sogin et al in 1972 “Phylogenetic measurement in procaryotes by primary structural characterization. Sogin SJ, Sogin ML, Woese CR. J Mol Evol. 1971;1(1):173-84. This paper laid out some of the arguments for why rRNA sequence information might re-write our concepts of classification of prokaryotes. From this and many of the other papers from Woese and Fox and others before 1977 it had been shown that one could use rRNA sequence information to more accurately infer relationships among “prokaryotes” than had been done previously with other types of information. Today this notion that we can use sequence information to infer the evolutionary history of microbes is taken for granted but back in the early 1970s it was not. And in addition, many people probably just did not care too much about the exact details of microbial phylogeny and classification.
But this changed in the 1977 with that series of papers outlined above. What these papers showed was that hidden beneath everyone’s noses was a separate, previously unknown, major split in the prokaryotes into two distinct lineages. One of these included all the standard bacteria people were familiar with like E. coli and B. subtilis and one of them included some pretty weird wacked out bugs that thrived in extreme conditions. For example, look at the phylogenetic tree from Fox et al.
This tree (made using a distance based clustering algorithm where the distances represent a measure of the similarity of the catalog of short ologonucleotides found in each species) shows the normal bacteria on one side (down below) and methanogens and their relatives on another side. I like the last line of the abstract, which to an evolutionary microbiologist can be considered equivalent to Watson and Crick’s “It has not escaped our notice …”. Here Fox et al. say “These organisms appear to be distantly related to typical bacteria”
The Bach et al. paper has similarly interesting, cool nuggets. However, alas, it is not available in PubMed Central as are the other two papers here I am not focusing on it. What is great though is that the other two papers are freely available to anyone to read in Pubmed Central and also at the PNAS web site. Yay for access. Too bad the other paper is not freely available.
Anyway, fortunately, the most critical of these papers is the Woese and Fox paper from PNAS which is freely available And it is in this paper that they full argument is laid out. Consider the abstract:

ABSTRACT A phylogenetic analysis based upon ribosomal RNA sequence characterization reveals that living sys.tems represent one of three aboriginal lines of descent: (i) the eubacteria, comprising all typical bacteria; (ii) the archaebacteria, containing methanogenic bacteria; and (iii) the urkaryotes, now represented in the cytoplasmic component of eukaryotic cells.

In this paper they lay out the evidence for the existence of at least three main branches in the Tree of Life. Interestingly, for the phylogenetically minded people out there, they do not show an evolutionary tree in the paper. What they show is what is known as a similarity matrix (the inverse in essence of the distance matrices many people may be used to seeing) where a score is given for the similarity between organisms in the fingerprints of their 16S/18S rRNAs).
If one scans through the matrix one can clearly see three clusters of similarity scores
From this table, Woese and Fox infer the existence of three primary branches in the tree of life. This is laid out in a few paragraphs starting with one at the bottom of page 5088.

A comparative analysis of these data, summarized in Table 1, shows that the organisms clearly cluster into several primary kingdoms. The first of these contains all of the typical bacteria so far characterized …. (lots of names here) … It is appropriate to call this urkingdom the eubacteria.

And then a second paragraph discusses the second group
A second group is defined by the 18S rRNAs of the eukaryotic cytoplasm-animal, plant, fungal, and slime mold (unpublished data). … (They call this lineage the urkaryotes).
And then the third paragraph lays out the revolution:

Eubacteria and urkaryotes correspond approximately to the conventional categories “prokaryote” and “eukaryote” when they are used in a phylogenetic sense. However, they do not constitute a dichotomy; they do not collectively exhaust the class of living systems. There exists a third kingdom which, to date, is represented solely by the methanogenic bacteria, a relatively unknown class of anaerobes that possess a unique metabolism based on the reduction of carbon dioxide to methane (19-21). These “bacteria” appear to be no more related to typical bacteria than they are to eukaryotic cytoplasms. Although the two divisions of this kingdom appear as remote from one another as blue-green algae are from other eubacteria, they nevertheless correspond to the same biochemical phenotype. The apparent antiquity of the methanogenic phenotype plus the fact that it seems well suited to the type of environment presumed to exist on earth 3-4 billion years ago lead us tentatively to name this urkingdom the archaebacteria. Whether or not other biochemically distinct phenotypes exist in this kingdom is clearly an important question upon which may turn our concept of the nature and ancestry of the first prokaryotes.

Mind you, the whole paper is worth reading, but those three paragraphs lay out a revolution in how one thinks about the tree of life. Now admittedly, some of our notions of the tree of life have changed since 1977 and there is much more of a feeling of mixing and merging between branches than was appreciated back then. And some definitely feel that the archaebacteria (or archaea as they are known today) are not per se a third branch in the tree of life but rather than there are four or five major branches and that archaea may not in fact be a “monophyletic grouping”. But whether you think archaea truly represent a third branch in the tree of life or not, this paper fundamentally altered how we think about the tree and about microbes. The work was even written up in the New York Times and got a lot of press (not that that is proof of anything – but it got microbial phylogeny into the public’s mind).
I think it is worth having all biology students read and understand this paper. Which is why I now try to cover it in basically all classes whenever I can. I could go on and on, but I will simply end with their last paragraph:

With the identification and characterization of the urkingdoms we are for the first time beginning to see the overall phylogenetic structure of the living world. It is not structured in a bipartite way along the lines of the organizationally dissimilar prokaryote and eukaryote. Rather, it is (at least) tripartite, comprising (i) the typical bacteria, (ii) the line of descent manifested in eukaryotic cytoplasms, and (iii) a little explored grouping, represented so far only by methanogenic bacteria.

Citations
Woese CR, & Fox GE (1977). Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proceedings of the National Academy of Sciences of the United States of America, 74 (11), 5088-90 PMID: 270744

Fox GE, Magrum LJ, Balch WE, Wolfe RS, & Woese CR (1977). Classification of methanogenic bacteria by 16S ribosomal RNA characterization. Proceedings of the National Academy of Sciences of the United States of America, 74 (10), 4537-4541 PMID: 16592452

Balch WE, Magrum LJ, Fox GE, Wolfe RS, & Woese CR (1977). An ancient divergence among the bacteria. Journal of molecular evolution, 9 (4), 305-11 PMID: 408502


Some related posts

More coverage of the GEBA "Phylogeny Driven Genomic Encyclopedia"

Just a quick note here to post some links to additional stories about my new paper on “A phylogeny driven genomic encyclopedia of bacteria and archaea” which was published last week in Nature (with a Creative Commons license – which is rare in Nature but is what they use for genome sequencing papers).

Carl Zimmer has an article today in the New York Times “Scientists Start a Genomic Catalog of Earth’s Abundant Microbes”  about the paper and the project.  In the article he interviews me and Hans-Peter Klenk, who was a co-author and led the culturing part of the project.  A few things to note about this:

  • It is rare to have archaea mentioned in the New York Times.
  • There is a tree that goes along with the article which is a modified version of the tree we had in our paper.  I think theirs is very nice. Kudos to their artist
  • There is a quote by Norm Pace generally supportive of the project 
  • The article mentions the JGI Adopt a Microbe program and even has a shout out to Malcolm Campbell at Davidson College and his recent PLoS One paper where they discuss results from a project where they took one of the genomes from our project and used it as part of a course on genome annotation/analysis. 

For some of the story behind the paper see my recent blog post “Story Behind the Nature Paper on ‘A phylogeny driven genomic encyclopedia of bacteria & archaea’ #genomics #evolution

Other discussions worth checking out

Also see

ResearchBlogging.org

Wu, D., Hugenholtz, P., Mavromatis, K., Pukall, R., Dalin, E., Ivanova, N., Kunin, V., Goodwin, L., Wu, M., Tindall, B., Hooper, S., Pati, A., Lykidis, A., Spring, S., Anderson, I., D’haeseleer, P., Zemla, A., Singer, M., Lapidus, A., Nolan, M., Copeland, A., Han, C., Chen, F., Cheng, J., Lucas, S., Kerfeld, C., Lang, E., Gronow, S., Chain, P., Bruce, D., Rubin, E., Kyrpides, N., Klenk, H., & Eisen, J. (2009). A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea Nature, 462 (7276), 1056-1060 DOI: 10.1038/nature08656

Bakke, P., Carney, N., DeLoache, W., Gearing, M., Ingvorsen, K., Lotz, M., McNair, J., Penumetcha, P., Simpson, S., Voss, L., Win, M., Heyer, L., & Campbell, A. (2009). Evaluation of Three Automated Genome Annotations for Halorhabdus utahensis PLoS ONE, 4 (7) DOI: 10.1371/journal.pone.0006291

PLoS One Beta is released – a new way to publish and discuss scientific papers

Well just got an email from Chris Surridge of PLoS One saying their Beta Site is open to the public. I am excited by this new journal and system and plan to submit many of our papers there. People should check it out for themselves and hopefully give comments to them to make the system better. Some detail from the email is given below.

The first paper there that struck my eye is a paper on polyploidy in halophilic Archaea. This paper, by Sebastian Breuert, Thorsten Allers, Gabi Spohn, and Jörg Soppa suggests that polyploidy is more common in archaea than was previously appreciated.

—————————-
The email says:

Before your first visit, I want to let you know about the inherent challenges of this project and the philosophy that compels PLoS to confront them.

We want to speed up scientific progress and believe that scientific debate is as important as the investigation itself. PLoS ONE is a forum where research can be both shared and commented upon – we are launching it as a beta website so that the whole scientific community can help us develop the features.

What makes the site beta? Not the content, which features peer-reviewed research from hundreds of authors across a diverse range of scientific disciplines. It’s the additional tools and functionality surrounding these papers that will be continually refined and developed in response to user feedback.

It is this union of continually evolving user tools provided by the Topaz publishing platform and extensive content that will make PLoS ONE a success.

….

The first beta release of PLoS ONE features tools that allow users to annotate articles and participate in discussion threads. Our goal is to spark lively discussion online and we’d like to invite you to participate. Future updates will include user ratings for both papers and the comments made about them, personalized content alerts and much more.

We will be watching with interest to see how our new platform and software responds to high volumes of traffic and encourage you to give your feedback on your first experience via the site itself.