Story behind the paper: small RNAs in diatom (interview w/ Andrew Allen)

Here is another “Story behind the paper“.  This one focuses on the following paper: Norden-Krichmar, T.M., Allen, A.E., Gaasterland, T., Hildebrand, M.  (2011) Characterization of the small RNA transcriptome of the diatom, Thalassiosira pseudonana. PLoS ONE 6(8): e22870. doi:10.1371/journal.pone.002870
I wrote some questions up for Andrew Allen, one of the authors.  I note I did this before my “new” system of inviting authors to write guest posts directly themselves.  Not sure which approach is better but guest posts are certainly easier for me so I will probably do that more.

1.     What is the history behind this work?  How did it start?  Why did you do it?
These studies on small RNA in diatoms are the result of collaboration between my group at the J. Craig Venter Institute (JCVI) and Mark Hildebrand’s group at Scripps Institute of Oceanography (SIO). Each lab group is interested in the ecology, evolution, and physiology of diatoms. More specifically we would like to know more about how diatoms sense and respond to environmental signals. Therefore we are interested in mechanisms of transcriptional regulation in diatoms and other microalgae. An earlier study suggested that cytosine methylation is an important mechanism for repression of transcriptional activity of retrotransposons, and associated mobility, in diatoms. In response to stress, nitrogen stress especially, long terminal repeat retrotransposons (LTR-RTs) display decreased levels of cytosine methylation (hypomethylation) and elevated transcriptional activity.
Mamus, F., Allen, A.E., Mhiri, C., Hu, H., Jabbari, K., Vardi, A., Grandbastien, M.A., Bowler, C. (2009). Potential impact of stress activated retrotransposons on genome evolution in a marine diatom. BMC Genomics 10:624.
Classically small RNAs are known to play a key role in triggering gene silencing by DNA methylation. Also short interfering small RNAs (siRNAs) have been found to play a role in silencing retrotransposons and other repeat elements
Therefore we were interested to investigate the small RNA repertoire of diatoms. Our first experiments were based on 454 sequencing of libraries constructed from small RNA purified from the diatom Thalassiosira pseudonana. It was clear to us that, despite promising results, much deeper sequencing would be required for a meaningful characterization of the small RNA transcriptome. We used ABI SOLiD sequencing to further explore the diversity and expression of small RNAs in T. pseudonana. Although deep sequencing was ultimately necessary to obtain sufficient coverage and resolution for statistically sound analyses the SOLiD and 454 data were remarkably congruent.
At the time these studies were being conducted, 2009, there were some specific challenges associated with analyses of the SOLiD small RNA data. Extraction all types of small RNAs for a non-standard organism was not straightforward.
Initial processing of the SOLiD data using commercial products, such as ABI’s Small RNA Pipeline and CLCbio’s CLC NGS Cell reference assembly software, yielded an average of approximately 6% reads aligned to the T. pseudonana genome. For ABI’s Small RNA Pipeline, even when omitting the filtering step by known miRNAs from the Sanger miRBase, the software gave a higher priority to matching the adapter sequences rather than matching to the genome, in order to produce small RNAs in the miRNA size range. Similarly, because CLCbio’s CLC NGS Cell program was not able to align any sequence less than 27 nucleotides in length, and many small RNAs are in this size range, it also had to be abandoned in this study.
The methodology presented in this study provides the steps necessary to discover all types of small RNA genes in next generation sequence data, and to perform a comparative analysis of different libraries of sequence data. Briefly, an approach was necessary to extract the small RNA sequences from the constant 35 nucleotide colorspace format SOLiD data, convert the colorspace data to its basespace equivalent, and map the sequences to the reference genome.  The colorspace data, which is a numerical representation of the color produced during sequencing for each successive two-nucleotide pair, was first converted to its basespace equivalent using CLCbio’s tofasta software.  The basespace format sequences were then aligned to the T. pseudonana reference genome with BLAST, acting to simultaneously determine the alignment locations and trim the spurious adapter nucleotides from the ends of the small RNA sequences.  This method yielded a recovery rate of 22% of the reads aligned to the genome, which is two or three times more reads than the ABI SOLiD Small RNA pipeline and CLCbio’s NGS Cell program, thereby producing a large data set for further analysis.
2. What is next?

We would like to establish improved conceptual integration for the role of small RNAs in various aspects of diatom evolution, metabolism, and biochemistry. More highly resolved expression patterns of small RNAs in response to specific environmental conditions will be required to make associations between specific small RNA loci and specific cellular processes. It seems likely that copia type retrotransposons play a major role in diatom genome evolution through promoting genome rearrangements and modification of gene expression levels through displacement and insertion of various promoter binding sites. We would like to attain a better understanding of the role small RNAs in mediating transposon occurrence and transcriptional and insertional activity.  For example, in relation to retrotransposons, is the role of small RNAs strictly relegated to defense and silencing or do small RNAs also play a role in fostering establishment of transposons that ultimately have a positive impact on fitness?
3. Any interesting stories about the project like fights among authors (OK, maybe not that) – but anything more on the personal side of things?
The lead author of the study Trina Norden-Krichmar, a bioinformaticist, did a lot of the lab work for this project.  Diatom culturing, RNA purification, running gels,454 small RNA library construction, PCR, TOPO cloning, Northern blots, etc. are somewhat unusual activity for most bioinformaticians.  Interestingly, prior to earning a PhD Trina was a computer programmer who enjoyed open ocean swimming at the La Jolla Cove.  As a result of this recreational activity she was motivated to go back to school for a PhD in Marine Biology. Trina also authored a paper on small RNAs in the marine invertebrate Ciona.
4. Can you send links to any other information of value including Authors web sites
My JCVI
My Mendeley (which has all PDFs mentioned here)
Mark H.
Terry G.
Other papers of interest (e.g., some recent Nature paper by you)
Other recent studies of interest include a publication in Nature earlier this year, Evolution and metabolic significance of the urea cycle in photosynthetic diatoms.
Evolution of intracellular urea synthesis by the ornithine-urea cycle (OUC) is classically known to have facilitated a wide range of physiological innovations and life history adaptations in vertebrates. For example, urea synthesis enables rapid osmoregulation in elasmobranchs (sharks, skates, rays) and bony fish, and ammonia detoxification in amphibians and mammals, which was likely a prerequisite for life on land. Ruminants and some hibernating mammals recycle nitrogen between the liver and gut through urea.
Evolutionarily it was unusual and highly unexpected to find a gene encoding the OUC form of the gene carbamoyl phosphate synthetase (CPS) in diatoms. CPS evolution is evolution is a fascinating story and with many chapters of gene duplication and fusion. Origin of the ornithine-urea cycle can be traced to ancient duplication and subsequent neofunctionalization of ancestral eukaryotic carbamoyl phosphate synthase (CPS); CPSII. CPSII, renamed pgCPS in this study, to reflect function and substrate (pyrmidine metabolism and glutamine) is an ancient eukaryotic enzyme that resulted from fusion bacterial amidotransferase and synthetase subunits. Interestingly there is significant internal similarity within the synthetase domain which is the result of ancient duplication of a kinase domain. It has long been held that pgCPS duplicated in early diverging metazoans to form ugCPS (urea cycle, glutamine) which is targeted to mitochondria. Subsequently, in vertebrates, unCPS (urea cycle, ammonium) appeared and provided foundation for the modern vertebrate urea cycle. Therefore, discovery of unCPS in unicellular stramenopile and haptophyte algae was highly unexpected. Also, physiologically, in animals, the urea cycle is a catabolic pathway that ultimately serves to export fixed nitrogen (in the form of urea) from cells. It was somewhat puzzling and conceptually challenging to imagine a role for the urea within the context of photosynthetic cells. In addition to either glutamine or ammonium CPS utilizes inorganic carbon in the form of HCO3 and therefore represents a form of carbon fixation as well. In diatoms, it appears that the urea cycle is the basis for a distribution and repackaging hub for inorganic carbon and nitrogen and is particularly important for redistribution and turnover of cellular nitrogen following episodic pulses of nitrate; which occur during oceanic upwelling events. Although chloroplast and bacterial derived transfer of genes to the diatom nuclear genome have been described, very little is known about the contribution of the secondary endosymbiotic host (exosymbiont) to diatom metabolism. Results of this study indicate that the secondary endosymbiotic host genome made important physiological and biochemical contributions to the diatom nuclear genome sufficient to significantly distinguish secondary endosymbiotic algae from plants and green algae.
Also three studies have been published this year related carbon metabolism and the carbon concentrating mechanism (CCM) of diatoms. The occurrence of efficient CCM(s) in diatoms has long been hypothesized as a result of the relatively high affinity of diatom cells for inorganic carbon compared to much lower affinity of the enzyme RubisCO for CO2. In other words, in order to overcome RubisCO inefficiencies, such as slow turnover and a propensity to fix O2 (i.e., photorespiration), there has been strong evolutionary selection for cellular adaptations that enable elevated CO2 at the site of fixation by RubisCO. Also over geological time, atmospheric concentrations of CO2 have decreased while O2 has increased; presumably strengthening selection for CCMs in productive modern microalgae.
A manuscript by Hokinson et al published in PNAS is based on mass spectrometric measurements of passive and active cellular inorganic carbon fluxes in wild type and chloroplast carbon anhydrase (CA) over expression cell lines of the diatom Phaeodactylum tricornutum. Carbonic anhydrases (or carbonate dehydratases) are metalloenzymes that catalyze the rapid interconversion of carbon dioxide and water to bicarbonate and protons. Model simulations of these fluxes suggest that, due to membrane permeability to CO2, only around one-third of the inorganic carbon transported from the cytoplasm into the chloroplast is fixed photsynthetically; and the rest is lost by CO2diffusion back to the cytoplasm. Therefore in order to achieve the CO2concentration necessary to saturate carbon fixation it is hypothesized that CO2is most likely concentrated within the pyrenoid; a specialized non-membrane bound proteinaceous structure within the chloroplast that contains high levels of RuisCO.
Hopkinson, B.M., Dupont, C.L., Allen, A.E., Moreal, F.M.M. (2011). Efficiency of the CO2-concentrating mechanisms of diatoms. Proceedings of the National Academy of Sciences of the United States of America, USA. 108(10):3830-7.
In a paper by Tachibana et al. published in Photosynthesis Research nine and thirteen carbonic anhydrase (CAs) were identified and experimentally localized in the marine diatoms Phaeodactylum tricornutum and Thalassiosira pseudonana respectively. Immunostaining experiments show that PtCA1, a β-CA, is localized to the central part of the pyrenoid in the chloroplast.  Other CAs are shown to be localized to the periplastidal compartment, chloroplast endoplasmic reticulum, and mitochondria in P. tricornutum and the stroma and periplasm of T. pseudonana.
Tachibana, M., Allen, A.E., Kikutani, S., Endo, Y., Bowler, C., Matsuda. (2011). Localization of putative carbonic anhydrases in two marine diatoms, Phaeodactylum tricornutum and Thalassiosira pseudonana. Photosynthesis Research. Advance Access published March 2 2011, doi:10.1007/s11120-011-9634-4
A paper published by Allen et al. in Molecular Biology and Evolution (open access) examines  the functional diversification of fructose bisphosphate aldolase (FBA) genes in diatoms. Class I and class II FBAs are involved in Calvin-Bensen cycle reaction and glycolysis. Patterns of FBA evolution have been useful for questions related to chloroplast acquisition and evolution in primary and secondary endosymbiotic algae. The universal occurrence of class II FBAs in chromalveolate (diatoms, dinoflagellates, haptophytes and crytophytes) plastids has been interpreted as evidence for chromalveolate monophyly and a single origin for secondary plastid of red algal descent. In this new paper, Allen et al., demonstrate that class I and class II FBAs are localized to the diatom pyrenoid. Class II pyrenoid localized FBA appears to be the result of a chromalveolate specific gene duplication event. The significance of FBA localization in diatom pyrenoids in not fully understood but enzymatic activity and gene transcription appears significantly enhanced under periods of iron (Fe) limitation; when photosynthesis is somewhat down regulated. The authors suggest that pyrenoid localization of some Calvin cycle components might provide a regulatory link between CCM and Calvin cycle activity.
Allen, A.E., Moustafa, A., Montsant, A., Eckert, A., Kroth, P., Bowler, C. (2011). Evolution and functional diversification of fructose bisphosphate aldolase genes in photosynthetic marine diatoms. Molecular Biology and Evolution. Advance Access published September 8, 2011, doi:10.1093/molbev/msr223

Special Guest Post & Discussion Invitation from Matthew Hahn on Ortholog Conjecture Paper


I am very excited about today’s post.  It is the first in what I hope will be many – posts from authors of interesting papers describing the “Story behind the paper“.  I write extensive detailed posts about my papers and also have tried to interview others about their papers if they are relevant to this blog.  But Matthew Hahn approached me recently about the possibility of him writing up some details on his recent paper on the functions of orthologs vs. paralogs.  So I said “sure” and set up a guest account for him to write up his comments and details of the paper.  


For those of you who do not know, Matt is on the faculty at U. Indiana.  He was a post doc at UC Davis so I have a particular bias in favor of him.  But his recent paper has generated some controversy (I posted some links about it here).  So it is great to get some more detail from him.  In addition, I note, I am also using this approach to try and teach people how easy it is to write a blog post by getting them guest accounts on Blogger and letting them write up something with links, pictures, etc.  So hopefully we can get more scientists blogging too.


Anyway – without any further ado – here is Matt’s post:

———————————————————————–
Following Jonathan’s excellent example of how explaining the history of a project helps to illuminate how the process of science actually happens, I thought I’d start by giving a bit of history behind our study, and the paper that we recently published in PLoS Computational Biology (http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002073). And then I’ll address the critics…
How this all got started
It all started a bit more than three years ago, in the summer of 2008. Pedja (as Predrag Radivojac is known to friends) was giving a talk to a group of us on protein function prediction that he also presented as a tutorial at the Automated Function Prediction SIG at ISMB 2008. Pedja and I had already collaborated on a small project involving the evolution of phosphoryation sites, but I really had no idea about his work on function prediction, and little idea in general about how function prediction was done. Reviewing different ways to accomplish transfer-by-similarity, he eventually got around to evolutionary (phylogenomic) approaches. Here is what I remember of this specific exchange during his talk:
Pedja: …and of course these methods only use orthologs for prediction, because orthologs have more similar functions than do paralogs.
me (from audience): Who says?
Pedja: Umm, you say. I mean, the evolutionary biologists say.
me: No, we don’t. I don’t know of any data that says any such thing.
Pedja: Whatever, Matt. We’ll talk about this later.
Well, we did talk about it later, and it turned out that although this claim is made in tons of papers, there is basically no data to support it. In the best cases a real example of one gene family will be cited, but there are very few of these. In the worst cases, the authors will just cite some random paper about gene duplicates (or Fitch’s original paper defining orthologs and paralogs). Of course I agree that patterns of sequence evolution might lead you to conclude this relationship was true, but there was no experimental data.


In fact, as we say in our paper, rarely did anyone recognize that this claim needed to be tested, or even that it was a claim that could be tested. At the time Eugene Koonin was the only person to say this: “The validity of the conjecture on functional equivalency of orthologs is crucial for reliable annotation of newly sequenced genomes and, more generally, for the progress of functional genomics. The huge majority of genes in the sequenced genomes will never be studied experimentally, so for most genomes transfer of functional information between orthologs is the only means of detailed functional characterization” (http://www.ncbi.nlm.nih.gov/pubmed/16285863). I really liked the way that Eugene had said this, and started to refer to the idea that orthologs were more functionally similar than paralogs as the “ortholog conjecture.” So to be clear: I completely made up this phrase, but used the most evocative word from the Koonin paper.
Luckily for Pedja and me we had just gotten a small internal grant to work on genome annotation and we had an incoming master’s student (Nathan Nehrt) who was willing to work on a project intending to test the ortholog conjecture.
Interlude: the crappy state of things in the study of the evolution of function
In order to test anything about how function evolves between orthologs and paralogs—or between any genes—one of course needs some kind of data on gene function in multiple species. And this turns out to be a big problem.
Because, as Koonin says in the earlier quote, the vast majority of experimental data comes from a very few species, and these species are not exactly closely related. Here is an approximate phylogeny of the major eukaryotic model organisms:
It’s obvious from this figure that if you need both 1) lots of functional data from two species, and 2) a pretty good idea of exactly what the homologous relationships are between the genes you’re studying, you’re going to have to study human and mouse.
This is actually a pretty bleak picture for people who study molecular evolution (as I do). While we have tons and tons of sequence data both within and between species, and a very good idea about how these sequences evolve, and fancy models with which to analyze these sequences…we know next to diddly-squat about general patterns relating these sequence differences to functional differences. There are lots of interesting things to be gleaned from studies of sequence evolution, but it really would be nice to know something about the relationship between sequence and function.
What we found
What exactly does the ortholog conjecture predict? In my mind, at least, it predicts something like this:
In this completely fictitious graph the relationship between protein function and sequence similarity is a declining one, only it declines faster for paralogs than it does for orthologs. Also, just possibly, gene duplicates start out with slightly diverged function the minute they appear. Anyway, those were our predictions.
But here is what we found (Figure 1 in Nehrt et al. 2011):

(Panel A uses the Biological Process ontology and panel B uses the Molecular Function ontology.)
There are really two different, equally surprising results here. First, there is no relationship between sequence divergence and functional divergence for orthologs (among 2,579 one-to-one orthologs between human and mouse). Absolutely none—it’s a straight horizontal line. Second, there is a relationship for paralogs (among 21,771 comparisons), exactly as we predicted there would be. So according to our results, paralogs actually have more conserved function than do orthologs. Our interpretation of the data was that the most important determinant of function was the organismal context in which a gene/protein found itself: given the same amount of sequence divergence, two proteins in the same organism would be more functionally similar. For orthologs, this means that the sequence divergence of our target gene was not the most important thing, but rather the sum total of divergence in all of the genes that contribute to its cellular context. Which is why all the orthologs have on average similar functional divergence—they are all exactly the same age and hence have approximately the same levels of divergence in these interactors (in this case sequence divergence for paralogs is a much better indicator of their splitting time).
Without going through every result in the paper and our interpretation of every result, suffice it to say that after about a year-and-a-half of working on this (around February 2010), we were satisfied that we had something we were willing to submit. I even seem to remember showing the above figure to Jonathan on a visit to UC-Davis! So we did submit the paper, first to PNAS and then, after rejection, to PLoS Computational Biology, where it was rejected again.
The content of the reviews was approximately the same at both journals. Basically, people were not convinced of our results mostly because the functional relationships were all based on data in the Gene Ontology database. To be specific, the functional data we used came from experiments conducted in 12,204 different papers. We didn’t use any predicted functions, only functions assigned using experimental data. And we did A LOT of work to try to eliminate problems that might have affected our results, including repeating the main analysis using only GO terms common to both the human and mouse datasets. But there can still be bias hidden within these functional assignments because someone always has to interpret the experiment—to say that a yeast two-hybrid experiment means that a gene has function X. And because of these biases, people weren’t buying it.
To get a measure of functional similarity that did not depend on the interpretation of any experiments, we decided to repeat the entire analysis using microarray data, using the correlation in expression levels across 25 tissues as the measure of functional similarity. By this time Nathan was graduating and moving on to Maricel Kann’s lab as a research programmer, so we recruited one of Pedja’s Ph.D. students, Wyatt Clark, to pick up where Nathan had left off. (Wyatt had actually been a student in my undergraduate Evolution course a few years earlier, so we figured he knew something…) After repeating all of the GO-based analyses himself—always better to double-check, right?—Wyatt got all of the microarray data in order and produced this figure (Figure 4 in Nehrt et al. 2011):
So a year after we first submitted a paper, we submitted a new version to PLoS CB including the array analysis, and this was enough to convince the reviewers that our results were not merely due to some strange bias in GO.
The fallout, and some responses
First, let me say that I had some idea that this would be a controversial-ish paper, and that we’d get at least some blowback. For about the first 20 versions of the manuscript (including some submitted versions) I put the words “ortholog conjecture” in quotes in the title, never an endearing move. (Pedja finally convinced me to take them out of the latest submissions.) But I also thought people would be happier that an untested assumption had finally been tested—and we have definitely gotten some positive feedback along these lines, including several groups that told us they have data that support our findings. By coincidence my lab had another paper come out the same week as this one (http://www.ncbi.nlm.nih.gov/pubmed/21636278), and I mistakenly thought it would generate much more attention. I still think the biological importance of the results in that one are much greater than the ortholog conjecture results, but either because we didn’t publish in an open-access journal (Jonathan is always right) or simply because the function-prediction community is more active on the interweb tubes, there have been a surprising number of critical responses (partially collected here: http://phylogenomics.blogspot.com/2011/09/some-links-on-ortholog-conjecture-paper.html). So here are some responses to general critiques.
The ortholog conjecture says only that orthologs are similar.
Okay, this one is a bit unfair, as only one person has said this. The real problem here is that Michael Galperin seems to have deeply misunderstood what we mean by the ortholog conjecture. According to him the ortholog conjecture is “the assumption that orthologs (genes with a common origin that were vertically inherited from the same gene in the last common ancestor of the host organisms) typically retain the same function or have closely related ones.” Umm, no. In fact, if you really think this is what the ortholog conjecture says, then our results support it—human and mouse orthologs do typically have closely related functions. But we are explicitly testing for a difference between orthologs and paralogs, not whether or not orthologs retain any functions. At no point did we say (or even hint) that orthologs should not be used for functional prediction. The whole point of our analysis and conclusions is that we should stop ignoring paralogs, which would give us a ton more data to use for the prediction of functions.
The assignments of orthology and paralogy are incorrect.
This is an easy one: we did in fact get the definitions of in- and out-paralogs correct (and laid them out in Figure S1). According to Sonnhammer and Koonin: “Our definition of ‘outparalogs’ is: paralogs in the given lineage that evolved by gene duplications that happened before the radiation (speciation) event” (http://www.ncbi.nlm.nih.gov/pubmed/12446146). For the purposes of our study, this means that outparalogs are defined as any paralogs that diverged before the speciation event between human and mouse and inparalogs diverged after this speciation event. Outparalogs do not indicate only paralogs in two different species, though by necessity in our dataset inparalogs are only found in the same species (all in human or all in mouse). Therefore, with respect to our conclusion that the most important determinant of function is which genome you are found in (i.e. context), it wouldn’t matter if we had incorrect gene trees: we would never confuse two genes in the same species (either inparalogs or some of the outparalogs) with two genes in different species (all orthologs and the remaining outparalogs).
You should have inferred functions yourselves
This is a fair suggestion, and not having enough time to annotate functions for 40,000 proteins would be a pretty weak excuse for doing good science. Instead…I’ll just say that it turns out professional curators are much better at assigning functions than even the original study authors (see http://www.ncbi.nlm.nih.gov/pubmed/20829821). Curators have a much broader view of the whole set of terms available in any ontology, and a much more consistent idea of how to apply these terms. My favorite line from the above cited article: “…because of the relatively low accuracy of the authors’ submissions, the use of authors’ annotations did not result in saving of curators’ time…”
GO is not appropriate for this analysis because it is biased.
This is the most frustrating criticism of our study, if only because it’s partly true: GO is biased. In our paper we actually detail several of these biases, including the observation that the same set of authors will give two proteins more-similar functions than will two different sets of authors. We tried very hard to attempt to control for these biases, though of course one cannot account for all of them. The most uncharitable part of this critique, however, has to be the fact that people conveniently forgot to say that our array analysis was completely distinct from the GO-based analysis (though it has its own issues), and that Burkhard Rost’s analysis of protein-protein interaction (http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.0020079) was also completely free of any bias in GO and was consistent with all of our results.
More annoying than this, you’d think from some of the critiques of GO that it was some sort of fly-by-night operation that no one should ever depend on. I mean, c’mon—there are human curators and human experimenters and of course they’re all biased so badly one could never compare functions between proteins much less between species. What were we thinking? (Only that the original GO paper has been cited >7000 times.) Funnily enough, at several points during the course of this work Pedja suggested—only half-jokingly—that we should just assume the ortholog conjecture was correct and write a paper about how GO must be wrong. Seriously, though: one would think from the excuses people came up with for the problems inherent in GO that we should simply stop using it to, you know, predict function in other species. And we were applying it to two relatively closely related mammals, one of which is explicitly a model for the other.
What next?
Our paper laid out several explicit hypotheses about the evolution of function that arose from our findings. Unfortunately, testing any of these hypotheses will require a ton more functional data, in more than one species. I know there are multiple groups working to collect these sorts of labor-intensive datasets, and Pedja and I are thinking about doing it ourselves (with collaborators, of course!). Massive datasets that reveal protein function will always be a lot harder to collect than sequence data, especially ones free from biases.
So let’s get to it…

—————————

Note – Toni Gabaldón was trying to post a detailed response but Blogger kept cutting him off with a character limit.  So I have posted his response below.

I appreciate the effort by Matthew Hahnn on explaining the story behind his paper on the so-called “Ortholog conjecture” and on facing some of the criticism. This paper attracted my interest as that of many others that work on or just use orthology. For instance it was chosen by one of my postdocs for our “Journal Club” meeting. And it was discussed during our last “Quest for Orthologs” meeting in Cambridge. I think is raising a necessary discussion and therefore I think is a good paper. This does not mean that I fully agree with the interpretation and conclusions ;-). I hope to modestly contribute to this debate with the following post.

I think one of the causes that this paper has caused so much debate is that the conclusions seem to challenge common practice (inferring function from orthologs), and could be interpreted as the need of changing the strategies of genome annotation. I think, however, that one should interpret carefully these results before start annotating based on paralogous proteins. As I will discuss below one of the problems is that we need to agree in what is the conjecture to then agree in how to test it. I see three main points that can be a source of confusion: i) the issue of what is actually stated by this conjecture, ii) the issue of annotation, and iii) the issue of time

1) What is the “ortholog conjecture”?
Or in other terms, when should we expect orthologs to be more likely to share function than paralogs?. Always? Of course not. All of us would agree that two recently duplicated paralogs are likely to be more similar in function than two distant orthologs, so it is obvious that the conjecture is not simply “orthologs are more similar in function than paralogs”. In reality the expectation that orthologs are more likely to be similar in function than paralogs, as least this is how I interpret it, is directly related to the effect that duplication have on functional divergence. If gene duplication has some effect on functional divergence (even in not 100% of the cases), then, given all other things equal (divergence time, story of speciation/duplication events – except fpr the duplication defining the orthologs) one would expect orthologs to be more likely to conserve function.

I think this complexity is not well considered (by many authors, in general). Hahn refeers to the famous review of orthology by Koonin (2005) as the source for the term “ortholog conjecture”. However, In that paper this conjecture is discussed always within the context of genes accross two particular species, whether in Hahn’s paper it is taken as well to other contexts. Thus, the proper context in which to test this conjecture is only between orthologs and between-species paralogs. As we can see,  Red and purple lines in Hahn paper in figure2 do not show any clear difference.

 Secondly, Koonin was very cautions in his paper, stating that he was referring to “equivalent functions” and not exactly the same “function”, correctly implying that the functional contexts would be different in the two different species. This brings me to the next point.

ii) annotation
If the expectation of functional conservation of orthologs refers to a given pair of species, then it makes no sense to test that expectation between paralogs within the same species and orthologs in different species. We were interested in this issue and it took us some effort to control for this “species” influence on the comparison, if you are interested you can read our paper on divergence of expression profiles between orthologs and paralogs (http://www.ncbi.nlm.nih.gov/pubmed/21515902)

As Hahn founds, and it was anticipated by Koonin in that review, there is a huge influence of the “species context”, a big constraint of what fraction of the function is shared. Indeed I think is the dominant signal in Hahn’s paper. Why is that? One possibility is that the functional context determines the function, I agree. However, we should not discard biases in how different communities working around a model species define processes and function, also the type of experiments that are usually done. For instance experimental inference from KO mutants might be common from mouse, but I guess is not the case in humans (!!). I think this may be having a big influence and might even be the dominant signal in Hahns paper.

Finally function has many levels and I expect subfunctionalization mostly affect lower levels (i.e. more specific). Biases may also
 exist in the level of annotation between species or between families of different size (contributing more or less to the orthologs/paralogs class).

Microarray data are less likely to be subject to biases (although some may exist), at least they should be expected to be free of “human interpretation biases” and so Hahn and colleaguies did well, in my opinion, of testing that dataset. It is important to note that for microarrays and for orthologs and between-species paralogs (which I think is the right frame for testing the conjecture) ortholgs are more likely to share an expression context. This is compatible to what we found in the paper mentioned above, and compatible with the orthology conjecture as stated by koonin (accross species)

iii) time
 Finally, one aspect which I think is fundamental is the notion of “divergence time”. Since paralogs can emerge at different time-scales they are composed by a heterogeneous set of protein pairs. Most of comparisons of orthologs and paralogs (Hahn’s as well) use sequence divergence as a proxy of time. However this is only a poor estimate, specially when duplications (as in here) are involved (we explored this issue in the past: http://www.ncbi.nlm.nih.gov/pubmed/21075746). This means that for a given divergence time paralogs may have larger sequence divergence than orthologs at the same divergence time, or otherwise (if gene conversion is playing a role). Is the conjecture based on sequence divergence or on divergence time?, I think the initial sense of using orthology to annotate accross species is based on the notion of comparing things at the same evolutionary distance. Thus basing our conclusions on divergence times might not be the proper way of doing it.

CONCLUSIONS AND PROPOSAL FOR RE-STATEMENT

To conclude, and with the intention of going beyond this particular paper,
I would finish by saying that the key to the problem lies on how we interpret the so-called “ortholog conjecture” or how are our expectations on how function evolves. What I get from re-reading Eugene Koonin’s paper and how I am using that “assumption” in my day-to-day work is the following:

“Orthologs in two given species are more likely to share equivalent functions than paralogs between these two species”

Therefore the notion of “accross the same pair of species” is important and thus only part of the comparisons made by Hahn and colleagues could directly test this. Looking at the microarray and between-species comparisons data, the conjecture may even hold true!!

I, however, do think that the conjecture as stated above is limited and does not capture the complexity of orthology relationships. Indeed us, and many other researchers, are tuning the confidence of the orthology-based annotation based on whether the orthologs are one-to-one, one-to-many or many-to-many, even when orthologs are “super-orthologs” (with no duplication event in the lineages separating the two orthologs).

Since, the underlying assumption of the ortholog conjecture is that duplication may (not necessarily always) promote functional shifts, then many-to-many orthology relationships will tend to include  orthologous pairs with different functions.

 Thus I would re-state the conjecture (or expectation) as follows:

 “In the absence of additional duplication events in the lineages separating them, two orthologous genes from two given species are more likely to share equivalent functions than two paralogs between these two species”

 This would be a more conservative expectation, which is closer to the current use of orthology-based annotation that tends to identify one-to-one orthologs, rather than any type.

 When duplications start appearing in subsequent lineages thus creating one- or many-to-many orthology relationships, the situation is less clear. Following the assumption that duplications may promote functional divergence. Then one could expand the conjecture by “the more duplications in the evolutionary history separating two genes, the lower the expectation that these two genes would share equivalent functions”.

 I wrote this contribution on the fly, and surely there are ways of expressing this in more appropriate terms. In any case I hope I made clear the idea that the conjecture emerges from the notion of duplications causing functional shifts and that our expectations will be clearer if expressed on those terms. This goes on the lines of what Jonathan Eisen mentioned on considering the whole phylogenetic story to annotate genes.

 Under this perspective, the real important hypothesis is that “duplications tend promote functional shifts”, I think this is based on solid grounds and has been tested intensively in the past.

 Cheers,

Toni Gabaldón

http://treevolution.blogspot.com

More on ‘phylogenomics’ – as in functional prediction w/ phylogeny

There is a new paper out: Phylogenetic-based propagation of functional annotations within the Gene Ontology consortium in Briefings in Bioinformatics.

The paper is interesting and presents a new general approach to using phylogeny for functional prediction of uncharacterized genes. I am interested in this for many reasons including that I was one of, if not the first to lay this out as a concept.  In a series of papers from 1995-1998 I outlined how phylogenetic analysis could be used to aid in functional prediction for all the genes that were starting to be sequenced in genome projects without any associated functional studies (at the time, I referred to all these ESTs and other sequences as an “onslaught” – little did I know what was to come).

My first paper on this topic was in 1995: Evolution of the SNF2 family of proteins: subfamilies with distinct sequences and functions.  The abstract is below:

The SNF2 family of proteins includes representatives from a variety of species with roles in cellular processes such as transcriptional regulation (e.g. MOT1, SNF2 and BRM), maintenance of chromosome stability during mitosis (e.g. lodestar) and various aspects of processing of DNA damage, including nucleotide excision repair (e.g. RAD16 and ERCC6), recombinational pathways (e.g. RAD54) and post-replication daughter strand gap repair (e.g. RAD5). This family also includes many proteins with no known function. To better characterize this family of proteins we have used molecular phylogenetic techniques to infer evolutionary relationships among the family members. We have divided the SNF2 family into multiple subfamilies, each of which represents what we propose to be a functionally and evolutionarily distinct group. We have then used the subfamily structure to predict the functions of some of the uncharacterized proteins in the SNF2 family. We discuss possible implications of this evolutionary analysis on the general properties and evolution of the SNF2 family.



I note – I am annoyed that when I went to the Nucleic Acids Research site for my paper I discovered for some bizarre reason they are now trying to charge for access to it even though it is in Pubmed Central and used to be freely available on the NAR site.  WTF?  Is this just an IT issue like the #OpenGate complaints I made for a while about Nature Genome papers.

Anyway – in that paper in 1995 I basically showed that at least for this family, phylogenetic analysis could be used as a tool in making functional predictions by allowing one to better identify orthology relationships and subfamilies within the SNF2 superfamily.  This was novel I think maybe a little bit but others at the time were also looking into using various analyses to identify orthology relationships across genomes.

Shortly thereafter I started working on the concept that one could used the phylogenetic tree more explicitly in making functional predictions and eventually I laid out the concept of treating function as a character states and doing character state reconstruction using a gene tree to then infer functions for uncharacterized genes.  I called this approach “phylogenomics” in a paper in 1997 in Nature Medicine (the editor asked us to give it a name … and thus my own contribution to the omics word game began).  Alas somehow the title of our paper became “Gatrogenomic delights” a movable feast” since we were writing about the E. coli and H. pylori genomes, so I added yet another omics term at the same time.  In the paper I showed how phylogenetic analysis of the MutS family of proteins could help in interpreting one of the findings in the H. pylori genome paper:

In this paper we showed why blast searches were not ideal for inferring relationships among sequences (because blast measures similarity NOT evolutionary history per se).  A bit annoyed still that other papers then sort of claimed they were the first to show blast was not ideal for inferring evolutionary relatedness, but whatever. This still did not fully describe the phylogeny driven approach that I was working on so I then wrote up an outline of this approach for a paper in Genome Research: Phylogenomics: Improving Functional Prediction for Uncharacterized Genes by Evolutionary Analysis.  This paper really laid out the idea in more detail:

It also gave detailed examples of how similarity searches could be misleading and how phylogenetic analysis should in principle be better.

I note – I am very very proud of this paper.  But it did not do a lot of things.  Really it was about laying out a concept of using tools from phylogenetics in functional prediction.  But it did not provide software for example.  I later developed some of my own scripts for doing this when I was at TIGR but really the software for phylogeny driven functional predictions would come later from others like Kimmen Sjolander, Sean Eddy, and Steven Brenner.  Each method laid out in these tools and in other papers had its own flavors and I continued to explore various approaches and applications to phylogeny driven functional prediction.  Examples of my subsequent work are listed below (with links to the Mendeley pages for these papers):

Plus we (at TIGR) used phylogenetic analysis as a tool in annotation of many many genomes as well as metagenomes.

Anyway, enough of history for a bit.  What is interesting about this new paper is that they take a slightly different approach to phylogeny driven functional prediction in that they make use of Gene Ontology functional annotations as their key parameter to trace on evolutionary trees.  They lay out the differences in their method quite well in the introduction:

Our general approach is similar to the ‘phylogenomic’ method proposed by Eisen [6] and further developed into a probabilistic form by Engelhardt et al. [7], but with important differences. Eisen proposed a conceptual approach for predicting protein function using a phylogenetic tree together with available experimental knowledge of proteins. The original approach relied on manual curation to identify gene duplication events and to find and assimilate the literature for characterized members of the family. Engelhardt et al. used automated reconciliation with the species tree [8] to identify gene duplication events, and experimental GO terms (MF only) to capture the experimental literature. Using this information, they defined a probabilistic model of evolution of MF involving transitions between different molecular functions.

From these previous studies, we adopt the basic approach of function evolution through a phylogenetic tree and the use of GO annotations to represent function. However, unlike these other phylogenomic methods, we represent the evolution in terms of discrete gain and loss events. In Eisen’s original model, an annotation does not necessarily represent a gain of function (it could have been inherited from an earlier ancestor), and losses are not explicitly annotated. The transition-based model of Engelhardt et al. assumes replacement of one function by another (gain of one function coupled to the loss of another), and does not capture uncoupled events, which is particularly important for BP annotations and cases where a protein has multiple molecular functions (see examples below). In addition, we make no a priori assumptions about conservation of function within versus between orthologous groups, or about the relationship between evolutionary distance and functional conservation (as the distance may not necessarily reflect every given function). While, as described below, gene duplication events and relatively long tree branches are important clues for curators to locate functional divergence (gain and/or loss), in our paradigm an ancestral function can be inherited by both descendants following a duplication (resulting in paralogs with the same function) or gained/lost by one descendant following a speciation event (resulting in orthologs with different functions). Evolution of each function is evaluated on a case-by-case basis, using many different sources of information about a given protein family

I note – Paul Thomas, one of the authors here has also been developing phylogeny driven functional prediction methods for many years and has done some cool things previously.  This new approach seems novel and useful and their paper is worth looking at.  I like too that they focus on MutS homologs for some of their examples:

Anyway – their paper is worth a read and some of their software tools may be of use including PAINT: http://sourceforge.net/projects/pantherdb/ and http://pantree.org

Good to see continuous developments in phylogeny driven functional predictions.  If you want to learn more – check out the Mendeley Group I have created:

http://www.mendeley.com/groups/1190191/_/widget/29/5/

And please contribute to it. Below are some previous posts of mine of possible interest:

I think that I shall never see – metagenomic analysis as lovely as a tree #PhylogenyRules #PLoSOne

ResearchBlogging.org

Figure 2. Phylogenetic tree linking
metagenomic sequences from 31 gene
families  along an oceanic depth gradient
 at the HOT ALOHA site

I am a co-author on a new paper that came out in PLoS One yesterday.  The paper is PLoS ONE: The Phylogenetic Diversity of Metagenomes and the full citation is Kembel SW, Eisen JA, Pollard KS, Green JL (2011) The Phylogenetic Diversity of Metagenomes. PLoS ONE 6(8): e23214. doi:10.1371/journal.pone.0023214.

The first author is Steven Kembel, a brilliant post doc at the University of Oregon.  You can follow him on twitter here. This paper is a product of the “iSEEM” “integrating statistical, ecological and evolutionary approaches to metagenomics” collaboration between my lab and the labs of Jessica Green at U. Oregon and Katie Pollard at UCSF.  For more on iSEEM see http://iseem.org.  iSEEM was supported by the Gordon and Betty Moore Foundation.

Anyway – the paper focuses on developing and using a new method for assessing the phylogenetic diversity of microbes via in samples via analysis of metagenomic data.  Phylogenetic diversity (aka PD) is measured by building evolutionary trees and summing up the total length of branches in such trees.  It is an important diversity metric and is complementary to metrics such as “species richness” which is a measure of the number of species in a sample. When one counts species in a sample, one ends up ignoring the evolutionary distances between species and thus one may get an incomplete picture of the diversity of organisms in a sample simply by counting species.  For example, a sample that contains 500 different species in the genus Escherichia would have the same “richness” as a sample that contained one representative of each of 500 different Orders of bacteria.  For many purposes it is useful to know whether one has a phylogenetically diverse sample or not.  (And of course, if one just focuses on species richness it is also important to not simply ignore some set of organisms in the samples as has sort of been done in a recent paper estimating the total species richness on the planet).  But that is not the point here – the point here is that counting species, even if done correctly, can give an incomplete picture of the diversity of organisms in sample.

For many years researchers have been attempting to measure phylogenetic diversity of various organisms in various samples.  And to do this one needs an evolutionary tree of the organisms in order to then measure branch length in the tree.  There is actually a relatively rich history of researchers attempting to look at PD in studies of microbes – especially in cases where one has access to a rRNA tree for the organisms / samples in question.  Examples of past work on this include:

What we wanted to do here was use metagenomic data to assess phylogenetic diversity of samples.  And in particular we wanted to do this with genes other than rRNA genes (e.g., protein coding genes).  There were multiple challenges in being able to do this (e.g., see a blog post I made about this issue a few years ago asking for community input).  Fortunately, Kembel has worked previously on multiple issues relating to phylogenetic diversity and phylogenetic ecology and his work led to this paper.

I note, as an aside, I have created a Mendeley group focusing on phylogenetic analysis of metagenomes and have added a diversity of papers to the collection:

http://www.mendeley.com/groups/1152921/_/widget/29/2/

In the paper Steve basically started with some of the notions and the code from AMPHORA which was designed by Martin Wu (when he was in my lab).  AMPHORA automatically infers phylogenetic trees of a set of 31 protein coding genes – and it can do this from genomic or metagenomic data. 
AMPHORA was designed to build phylogenetic trees of metagenomic sequences individually – in order to classify reads from samples to infer from what organism they likely came
But that is not what Steven wanted to do here.  What he wanted to do was infer phylogenetic trees from metagenomic samples where ALL the organisms in the sample were included in the same tree.  This was / is challenging for many reasons and this is what I had written the blog post about previously.  One issue we had was the fact that sequences might not overlap with each other and thus including them in a single phylogenetic tree together was complicated.  
From my earlier post:
The challenge with this is really two things. First, we want to analyze just the reads themselves (i.e., we do not want to use assemblies you can make from this type of data). Second, and more importantly, we want to include in our analysis sequence reads that only cover small, not necessarily overlapping regions of the “full length” sequence alignments for the family. 

The alignment would look something like

    sequence 1 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 1 XXXXXXXXX————————-
    sequence 2 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 2 ———XXXXXXXXXXXX————-
    fragment 3 ———————XXXXXXXXXXXXX
    fragment 4 —-XXXXXXXXXXXXXXXXXX————
    sequence 3 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    sequence 4 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    sequence 5 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 5 ———————–XXXXXXXXXXX- 
    where Xs are the regions covered by the sequences/fragments (could be DNA or amino acids)


We want to build trees from these alignments with the hope of using them to learn lots of cool things about the evolution of the fragments and the species from which they come. I can provide more information but really the key part for the phylogenetics here is the nature of the alignment.

In the past, I have decided to constrain my analyses to NOT deal with this type of alignments. I have either analyzed each fragment on its own or we have built a multiple alignment but only inlcuded fragments that cover more than 3/4 of the full length sequence and thus the matrix is much more filled out. Such an alignment would look like this

    sequence 1 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 1 XXXXXXXXXXXXXXXXXXXXXXXXXXX——-
    sequence 2 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 2 –XXXXXXXXXXXXXXXXXXXXXXXX——–
    fragment 3 —–XXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 4 —-XXXXXXXXXXXXXXXXXXXXXXXXXXXX–
    sequence 3 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    sequence 4 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    sequence 5 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 5 –XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX- 

But we really want to include the smaller fragments in our analysis. And we are just not certain how to best do this. We know LOTs of people out there think of similar problems in terms of sparse matrices, supermatrices, supertrees, EST data, etc. And we have ideas about how to do this and are asking around by email some phylogenetics gurus we know. But I thought it might be fun to have the discussion on a blog rather than by email.

So again, how might one best build phylogenetic trees from data that looks like this?

    sequence 1 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 1 XXXXXXXXX————————-
    sequence 2 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 2 ———XXXXXXXXXXXX————-
    fragment 3 ———————XXXXXXXXXXXXX
    fragment 4 —-XXXXXXXXXXXXXXXXXX————
    sequence 3 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    sequence 4 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    sequence 5 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 5 ———————–XXXXXXXXXXX


And from these trees we want to place each fragment relative to (1) the full length sequences and (2) to each other if possible. We also, of course, want branch lengths to reflect some sort of amount of evolution and thus do not just want a cladogram.

So what Steven decided to do in the end was create a method that took all of the AMPHORA markers and concatenated them together into a single mega alignment and then built a reference tree of this mega alignment from available genomes.  Then he searched for matches to any of these genes in metagenomic data and built a tree for each sequence that placed it relative to the reference data.  
Figure 1. Conceptual overview of approach to infer phylogenetic relationships among sequences from metagenomic data sets.
This pipeline allowed him to place many sequences from metagenomic samples onto a single tree such as this one:

Phylogenetic tree linking metagenomic sequences from 31 gene
families along an oceanic depth gradient at the HOT ALOHA site 

And from that he could calculate PD for metagenomic samples.  We then used the PD calculations to comparate and contrast PD with other information in particular from the HOT ALOHA metagenomic data set of Ed Delong, Steve Karl and others.

Figure 3. Taxonomic diversity and standardized
phylogenetic diversity versus depth in environmental
samples along an oceanic depth gradient at the HOT ALOHA site.

For more detail on what we did from there on – read the paper.  It is open access so all can see it / download it / play with it / whatever.  But rather than blather on and on as usual I thought I would email Steve some questions and then post his answers.  These are below:

Can you provide any background to how this work got started and why you ended up doing it?

This work got started as a collaboration between the Eisen, Green, and Pollard labs as part of the iSEEM project (“Integrating Statistical Evolutionary & Ecological Approaches to Metagenomics”), which was funded by the Moore Foundation to figure out ways to address ecological and evolutionary questions using metagenomic data. I had a background in using phylogenetic and evolutionary information to understand ecological communities, and one of the things I wanted to do at iSEEM was to try to think about ways that we could apply methods from ecophylogenetics or phylogenetic community ecology to metagenomic data sets. In conversations among the co-authors, we realized that if we could build phylogenetic hypotheses for organisms based on metagenomic data, we could apply a huge body of ecological and evolutionary theory and use these data sets to improve our understanding of microbial communities and their dynamics.

2. How did you end up working on microbes with your background in larger organisms?

The transition from working on macro-organisms to working on microbes actually wasn’t that big of a leap, since my research has generally been question driven rather than study-system or study-organism driven. My previous research involved using phylogenetic information to better understand community assembly in plants and animals. The increasing availability of phylogenetic information for entire communities of plants and animals drove the development of the field of ‘ecophylogenetics’, and it always seeemed to me that microbes would be the ideal system for this type of approach due to the greater availability of sequence data and phylogenetic information for microbes. Also, the development of high-throughput  sequencing methods meant that the size of microbial community data sets would quickly become really, really large… the prospect of working on data sets with hundreds of millions of observations was really exciting. As my first postdoc was wrapping up, I collaborated on a study looking at phylogenetic diversity of the rhizobacterial symbionts of plant roots that got me interested in microbial ecology. Right around that time I came across the opportunity to work on the iSEEM project, so it seemed like the perfect opportunity to try a new study system.

Having studied the community ecology of both micro- and macro-organisms, I find it interesting that the fields of microbial and non-microbial phylogenetic community ecology have been fairly insulated from one another until recently. For example, the two fields independently developed phylogenetic approaches to community ecology, each field having its own set of favored statistical methods and software packages, with almost no cross-citation, despite addressing very similar questions. In microbiology the emphasis on phylogenetic diversity measures seems to have been driven by the empirical difficulty of defining microbial ‘species’ and other taxonomic units that macro-organismal ecologists are comfortable with, as well as the availability of phylogenetic and sequence data for microbes. Conversely, for macroorganisms the field of ecophylogenetics was driven by a desire to apply a large body of theory on the links between ecological and evolutionary dynamics to empirical data sets, but was relatively data poor in terms of phylogenetic information about individual species.

3. What was the biggest challenge in this work?

For me the biggest challenge was convincing myself and others that we could infer anything about organismal phylogenies from metagenomic data.  People had built phylogenies for individual genes from metagenomic data sets, but there was a lot of skepticism about how and whether it would be possible to infer a phylogeny for multiple genes given the short, non-overlapping nature of metagenomic sequences. A post on your blog provided a lot of useful feedback. In the end this challenge was overcome both through the availability of software packages for placement of short sequences onto reference phylogenies, as well as simulation and bootstrap analyses to make sure that the results we were finding were robust.

4. Any additional things left out of the paper that you would like to mention here? Other acknowledgements?  Annoyances?

There were a number of people involved in the iSEEM project, including Samantha Risenfeld and Aaron Darling, who did simulations that were very helpful in figuring out when and whether we could make inferences about phylogenetic relationships among metagenomic reads.

Our paper makes use of a large number of open-source software packages and I’d like to thank the people who made their code available for re-use in this way. In particular the short sequence placement methods implemented in packages like RAxML and pplacer made this study possible.

5. What (in general) are your current and future plans?

Right now I’m working at the Biology & the Built Environment Center on a number of projects studying the phylogenetic and functional diversity of microbes in indoor environments, trying to understand the interaction between architectural design and microbial diversity indoors, and the role indoor microbes play in human health and well being. I am still interseted in plant biology, and I have an ongoing project looking at the diversity and function of microbial communities on plant leaves (the ‘phyllosphere’) in tropical and temperate forests.

Kembel, S., Eisen, J., Pollard, K., & Green, J. (2011). The Phylogenetic Diversity of Metagenomes PLoS ONE, 6 (8) DOI: 10.1371/journal.pone.0023214

Get to know Jack & the story behind the paper by @gilbertjacka "Defining seasonal marine microbial community dynamics"

ResearchBlogging.org A few days ago I became aware of the publication of a cool new paper: “Defining seasonal marine microbial community dynamics” by Jack A. Gilbert, Joshua A Steele, J Gregory Caporaso, Lars Steinbrück, Jens Reeder, Ben Temperton, Susan Huse, Alice C McHardy, Rob Knight, Ian Joint, Paul Somerfield, Jed A Fuhrman and Dawn Field.  The paper was published in the ISME Journal and is freely available using the ISME Open option. If you want to know more about Jack (in case you don’t know Jack, or don’t know jack about Jack) check out some of his rantings material on the web like his Google Scholar page, and his twitter feed, his LinkedIn page, his U. Chicago page. But rather than tell you about Jack or the paper, I thought I would send some questions to the first author, Jack Gilbert and see if I could get some of the “story behind the paper” out of him.  Since Jack likes to talk (and email and do things on the web), I figured it was highly likely I could get some good answers.  And indeed I was right. Here are his answers to my quickly written up questions (been out of the office due to family illness)


1. Can you provide some detail about the history of the project … How did it start ? What were the original plans ? (not this much sequencing I am sure)

The Western English Channel has been studied for over 100 years, and is in fact it is the longest studied marine site in the world. It is the home, essentially of the Marine Biological Association, and has a long history. The idea to start contextualizing the abundant metadata (www.westernchannelobservatory.org) was started in 2003 by Ian Joint, a senior researcher at Plymouth Marine Laboratory (www.pml.ac.uk), who saw the benefit of collecting microbial life on filters and storing these at -80C. It was his vision to create and maintain this collection that enabled us to go back through this frozen time series and explore microbial life. I started working for PML in 2005, and basically was charged with trying to identify a potential technique to characterize the microbial life in these samples. initially we got funding through the International Census of Marine Life to performed 16S rDNA V6 pyrosequencing on 12 samples. We chose 2007 as the first year, almost arbitrarily, and published that work in Environmental Microbiology in 2009 (http://onlinelibrary.wiley.com/doi/10.1111/j.1462-2920.2009.02017.x/abstract). However, we had already decided to go ahead, and with help from Dawn Field (Center for Ecology and Hydrology, UK) we were able to secure funding to pyrosequence 60 further amplicon samples, essentially we did 2003-2008. We deposited all these in the ICoMM dataset (link below) and it quickly became the largest study in the series. This was also a gold standard study for the Genomic Standards Consortium’s MIMARKS checklist (http://www.nature.com/nbt/journal/v29/n5/full/nbt.1823.html). We published the first analysis of these data in Nature Preceedings in 2010 (http://precedings.nature.com/documents/4406/version/1). We continued to characterize the microbial communities of the L4 sampling site in the Western English Channel by employing Metagenomic and Metatranscriptomic along side more 16S rRNA V6 pyroseqeuncing across diel and seasonal time scales throughout 2008 (the final year of the 6 year time series. This study was published in PLoS ONE also in 2010 (http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0015545). This study also included our first analysis fo archaeal diversity in the English Channel, which was also funded through the ICoMM initiative. We owe a lot to Mitch Sogin’s group for the first attempts at data analysis for the 16S rDNA profiles. We had a lot of difficulty getting the message right for the 6-year paper that was recently published in ISME J. Basically it was an issue of sequencing data as Natural History, we were generating data catalogs, and not doing enough to characterize the ecology interactions that occurred there.  So we reached out to the community, and found research groups who could help us plug that gap. Those involved Rob Knight’s team, Alice McHardy’s team, and Jed Fuhrman’s team. We worked a lot of improving this paper, and had some valuable help from a wide selection of other researchers, including Steven Giovannoni, Doug Barlett, among many others.

The publication of this study however, is just the start. 

2. Who collected the samples? Any good field stories?

Samples were all collected by the fantastic boat staff at Plymouth Marine Laboratory, who routinely go out every Monday morning to collect water and specific samples for the whole laboratory. They were the life blood of that organization. One specific I always like to relate is that during the 2008 sampling season which generated samples for both the new ISME J paper (http://www.nature.com/ismej/journal/vaop/ncurrent/full/ismej2011107a.html) and the 2010 PLoS ONE paper (http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0015545), we wanted to get diel sampling effort during the winter spring and summer. Unfortunately the only time I could convince my group to go out sampling for 24 hours was during the summer….some times science is limited by enthusiasm ;-). Also, the site is outside the Plymouth Sea Wall – which I think is still the largest concrete structure in the UK and was built in the 19th century, so taking people out to see the site (for what it was worth ;-)) meant taking them into usually very choppy water….which made people quite sick sometimes.In May 2009, J. Craig Venter and his crew came through to start the European leg of this Global Ocean Sampling expedition at L4, specificallly the Western English Channel. Together, our team at PML on our fishing boat, Plymouth Quest, and his team on-board the 100ft yacht, Sorcerer II sampled L4 and E1 (another monitoring site) in the Western English Channel. Excitingly these data form the first part of the attempt to start cataloguing the viral and Eukaryotic metagenomic and metatranscriptomic analysis of these communities. This analysis is being also further characterized using meta-metabolomics run by Carole Llewelyn at PML and Mark Viant at University of Birmingham. Increasing the multi’omic nature of these data.

3. Can you give some web links for data, people involved , etc?

  • People on the paper – not an exhaustive list of those involved….this is a huge community effort.

4. What else do you want people to know ?

We have recently started to model the English Channel from both a taxonomic and functional perspective. I have attached a presentation that has cool gifs that demonstrate this, people can email me and request the gifs if necessary. These are generated by Peter Larsen at Argonne National Laboratory.This modelling is being driven by two new tools:(1) Predicted Relative Metabolic Turnover, which uses fucntional annotations from metagenomes to create predicted metabolomes, which enable us to accurate predict the turnover (relative consumption or production) of more than 1000 metabolites in the English Channel (http://www.microbialinformaticsj.com/content/1/1/4).(2) Microbial Assemblage Prediction, which enables the prediction of the relative abundance of every bacterial taxon at any given location and time, the predictions are driven by in situ or remotely modeled environmental parameter data. We used satellite data to produce the figures above, truely BUGS FROM SPAAAAACCCCCEEEE…..This is the new paradigm – creating information and predictive models from data – no longer will metagenomics be descriptive Natural History – it is now becoming ECOLOGY. These tools will form the corner stone the Earth Microbiome Project’s (www.earthmicrobiome.org) data analytical initiative to create predictive models of microbial taxonomic community abundance structure and functional capability defined as the ability of a community to turnover metabolites.

Note – as a bit of a side story – I am disappointed in the ISME Journals “Open” option for publishing which, though it uses a creative commons license, it is a pretty narrow one that says, for example “You may not alter, transform, or build upon this work.” That is pretty limiting.  It means, for example, that the text cannot be reworded into a database of full text of papers where one uses intelligent language processing methods to play with the text.  It also means technically I probably cannot take the figures and modify them in any way to, for example, make an interesting movie using them.  Imagine if Genbank worked this way.  Imagine if you could only look at sequences but could not make alignments of them.  It is, well, not very open. So really this should be called the ISME “No charge” option or something like that since this is not “open access” to me – I think “open access” should really be reserved for material that is free of charge and free of most/all use restrictions (I prefer  the broader version of the “open access” definition described by Peter Suber.).  Sure – the fact that ISME makes some stuff available at no charge is nice.  And that they use CC licenses is good too since these are very straightforward to interpret compared to other licenses.  But their use of the no derivatives option seems silly. Anyway – nice paper.  And I hope some of the story behind the paper is useful to people.

Reference:

Gilbert JA, Steele JA, Caporaso JG, Steinbrück L, Reeder J, Temperton B, Huse S, McHardy AC, Knight R, Joint I, Somerfield P, Fuhrman JA, & Field D (2011). Defining seasonal marine microbial community dynamics. The ISME journal PMID: 21850055

The story behind the story of my new #PLoSOne paper on "Stalking the fourth domain of life" #metagenomics #fb

Well, here goes.

This is a post about a paper that has been a long long time coming. Today, a paper of mine is being published in PLoS One. The paper is titled “Stalking the Fourth Domain in Metagenomic Data: Searching for, Discovering, and Interpreting Novel, Deep Branches in Marker Gene Phylogenetic Trees” and is available at http://dx.plos.org/10.1371/journal.pone.0018011. (or if that link does not work you can get a copy here). This paper represents something I started a long time ago and I am going to try to describe the story behind the paper here.

I note – we are not doing a press release for the paper, for a few reasons. But one of them is that, well, I am starting to hate press releases. So I guess this is kind of my press release. But this will be a bit longer than most press releases. I note – my key fear here is that somehow in my communications with the press or in our text in the paper or in this post I will overstate our findings. Here is the punchline – we found some very phylogenetically novel forms of phylogenetic marker genes in metagenomic data. We do not have a conclusive explanation for the origin of these sequences. They may be from novel viruses. The They may be ancient paralogs of the marker genes. Or they may be from a new branch of cellular organisms in the tree of life, distinct from bacteria, archaea or eukaryotes. I think most likely they are from novel viruses. But we just don’t know.

UPDATE: Am posting some links here to news stories/blogs about our paper





    First – a summary of what we did.

    In the paper, we searched through metagenomic data (sequences from environmental samples) for phylogenetically novel sequences for three standard phylogenetic marker genes (ss-rRNA, recA, rpoB). We focused on sequences from the Venter Global Ocean Sampling data set because, well, we started this analysis many years ago when that was the best data set available (more on this below). What we were looking for were evolutionary lineages of these genes that were separate from the branches that corresponded to the three known “Domains” of life (bacteria, archaea and eukaryotes).

    To search for such novel lineages in the metagenomic data, we built evolutionary trees using these genes where we included sequences from known organisms (and viruses) as well as sequences from metagenomic data. We then looked through the trees for groups that were both phylogenetically novel and included only environmental data (i.e., they were new compared to known organisms or viruses). This method did not work very well for rRNA sequences (largely because making high quality alignments of short phylogenetically novel rRNA sequences was difficult – more on this below). But with RecA and RpoB homologs we were able to generate what we believe to be robust phylogenetic trees. And in these trees we found evidence for phylogenetically very novel sequences in environmental data.

    Figure 1. Phylogenetic tree of the RecA superfamily. 

    Figure 3. Phylogenetic tree of the RpoB superfamily

    We then propose and discuss four potential mechanisms that could lead to the existence of such evolutionarily novel sequences. The two we consider most likely are the following

    1. The sequences could be from novel viruses
    2. The sequences could be from a fourth major branch on the tree of life

    Unfortunately, we do not actually know what is the source of these sequences. So we cannot determine which of the theories is correct. Obviously if there is a novel lineages of cellular organisms out there, well, that would be cool. But we have no evidence right now if that is what is going on. Personally, I think it is most likely that these novel sequences are from weird viruses. But as far as we can tell, they truly could be from a fourth major branch of cellular organisms and thus even though we did not have the story completely pinned down, we decided to finally write up the paper to get other people to think about this issue.

    Below I give all sorts of other details about the project in the following areas

    • The history of the project 
    • More detail on what is in the paper 
    • Follow up analysis and rapid posting with google Know 
    • Data deposition in Dryad 
    • Who was involved 
    • UPDATE: Funding for this work



    The history of the project

    Well, this is one of those projects for which the history is hard to explain. We started this work in 2004 when I was helping Venter and colleagues analyze the Sargasso Sea metagenome data. I was working at TIGR in 2003, which are the time was a sister institute to some of the institutes affiliated with the J. Craig Venter Institute (JCVI) (it was a complicated time). Craig had led a project to do a massive amount of shotgun sequencing of DNA isolated from the Sargasso Sea, which had been the site of many previous studies of uncultured microbes. And Craig, as well as some of the people working with him including John Heidelberg who was at TIGR, had asked me to help in analysis of the data. So I eventually went to a meeting about the project and got involved. It was quite exciting and I put a lot of effort into helping analyze the data.

    As part of my work on the project, I and Martin Wu and Dongying Wu did a variety of phylogenetic studies of genes and gene families. One of these, was a phylogenetic analysis of proteorhodopsin homologs showing massively more diversity in the Sargasso data than in the PCR experiments done by Delong and Beja and others.

    Figure 7 from Venter et al. 2004. 

    We also did the first “phylotyping” in metagenomic data using genes other than rRNA. We built trees of bacterial ss-rRNAs, RecAs, RpoBs, HSP70s, EF-Tus and EF-Gs and then assigned each sequence to a phylum from the trees. In this analysis we found a variety of interesting things. 

    Figure 6 from Venter et al. 2004. 
    One thing I did not include in the Sargasso paper was an analysis I did of RecA homologs where I tried to include ALL RecA-like genes from bacteria, archaea, eukaryotes and viruses. The trees I made were a bit unusual but I was not sure that the alignments I had made were robust or that I had found all the RecA-like genes of interest so I did not even show this to Craig et al. at the time.
    UPDATE: I note – our work on this project was supported by a grant from the NSF Assembling the Tree of Life program that was awarded to me and Naomi Ward and Karen Nelson. Those funds supported the development of many of the informatics tools we used in this analysis and Martin and Dongying were both working on that project.

    After the Sargasso paper was published in 2004 though, I continued to fester about the RecA trees. And I wondered – if instead of trying to classify bacterial sequences into phyla, what if I tried to look for RecAs, rRNAs and other genes that were completely new branches in the tree of life? I got the chance to start to play with this concept again when Venter and crew asked me to help analyze the data coming out of the Global Ocean Sampling project. Again, this project was very exciting and interesting.


    As part of the project, I helped Shibu Yooseph and others look into whether the GOS data revealed any completely new types of functionally interesting genes, much like I had shown for proteorhodopsin in the Sargasso data.  


    Figure 7 from Yooseph et al. 2007 . Phylogenies Illustrating the Diversity Added by GOS Data to Known Families That We Examined 






    And again my mind started wandering towards the question of “OK – so – if there are all these very unusual and novel functionally interesting genes, what about looking for unusual and very novel phylogenetic marker genes”? So finally, I got back to work on the issue.

    And so I built a better RecA tree by first pulling out all possible homologs of RecA and RecA like proteins from the GOS data and then building an alignment and a tree. And there they were. Some very f*%&$ novel RecAs – distinct from any previously known RecA like proteins as far as I could tell. And so with help from Dongying and the JCVI crew, we started building a story about novel RecAs. And then we looked at RpoBs. And found novel ones too. And in mid 2006 while Shibu and Doug worked on their papers that were to be submitted to PLoS Biology and I worked on a review paper too, I told Emma Hill (who has since changed her name to Emma Ganley due to some sort of wedding thing) at PLoS Biology about the an analysis that was consistent with the existence of a fourth domain of life. No overstating our findings really – just that we found very novel phylogenetic marker genes. And that I was working on a paper on it. But alas I never got it done, though I was happy to have convinced Venter to send the GOS papers to PLoS Biology and I think the papers that came out were good. Among the papers were my review (Environmental Shotgun Sequencing: Its Potential and Challenges for Studying the Hidden World of Microbes, Doug Rusch’s diversity paper The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific and Shibu’s protein family paper The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families as well as many others as part of the Ocean Metagenomics Collection at PLoS.

    And in the midst of all of this, we had our first child and we wanted to move back to Northern California to be closer to family (my wife’s family is all in the Bay Area and my sister and brother Michael were in N. Cal too). So I applied for jobs and eventually took at job at UC Davis and we moved to Davis. Needless to say, all of that put a bit of a crimp in my work productivity. And once I was up and running at Davis, it just took a long time to get back to the searching for novel deep branches in the tree of life. But finally, we did it (with periodic prodding from Craig Venter). And we put together a paper and got it submitted to PLoS One in October. The reviews were very positive and enormously helpful. And we finally got a revision in January and it was officially accepted in February 2011. Only some seven years after my first work on the project. Whew.

    More detail of what is in the paper
    Well, I am going to be posting here some additional detail on what is in the paper.



    Why we punted on analysis of very novel rRNAs.

    The problem with rRNA is that the sequences that come from environmental samples are not complete (i.e. they only correspond to portions of the rRNA genes). Unfortunately, this makes a key step in phylogenetic analysis difficult – the alignment of sequences. We actually found about 200 rRNA sequences that seemed unusual in a phylogenetic sense. However, we were not convinced that the alignments of these fragments to other rRNAs was robust. This is because the alignment of rRNAs is best done making use of the base pairing secondary structure of the molecule and not the base sequence (i.e., primary structure).

    With only rRNA fragments, we could not use the secondary structure to do the alignments because you need to whole molecule to determine the best folding. Combined with the fact that we were searching for very distantly related ribosomal RNAs which would be hard to align even if we had the whole molecule, we were stuck for a bit. It seemed impossible to look for really novel organisms.
    So that is when we turned to other genes. The key for this is that there are protein coding genes that are universal and that for known organisms show similar patterns to rRNA in trees. In fact, in 1995 I wrote a paper showing that trees of RecA were very similar to trees of rRNA. RpoB is also considered a very robust phylogenetic marker. For organisms that we have in the lab (i.e., cultured) – many people use these other genes for phylogenetic analysis. rRNA has been very important in part because of the ease with which one can PCR amplify it from environmental samples and the fact that it is very hard to PCR amplify protein coding genes from the environment. Metagenomics changes this. With random sequencing, you get data from all genes. This means we can pick and choose genes to analyze for phylogenetic analysis and do not have to rely on rRNA.

    So we went after RecA first, because it has been shown to be a good phylogenetic marker for studies of the tree of life. And we found some very novel branches in the RecA tree. And after analyzing these and convincing ourselves that they were indeed phylogenetically very novel we went after RpoB. And also found very novel branches.

    So the phylogenetic analysis I think is very robust.

    RecA and RpoB as phylogenetic markers

    Many genes have been used as alternatives to rRNA genes to build “Trees of Life” including all organisms. Each has their own flavors of advantages and drawbacks. Two commonly used ones are the RecA and RpoB superfamilies.

    The many possible explanations for finding novel forms of phylogenetic marker genes

    The phylogenetically novel phylogenetic marker genes we found could have many explanations including that they could be ancient paralogs of these genes (but not found in any genomes we have available), they could be from viruses, or they could be from a novel branch on the tree of life. Or our trees could be bad. We think the latter is somewhat unlikely as our analysis has many lines of support. For example our RecA trees are very similar to those from a comprehensive study from M. Nei’s lab except they did not include the metagenomic data. But I guess it is still a possibility that our trees are biased in some way (e.g., by long branch attraction or bad alignments)

    Follow up analysis and rapid posting via Google Knol

    Amazingly and a bit sadly, I think we rushed the paper out. We left out one thing partly by accident – we had done an analysis of the locations from which these novel RecA and RpoB sequences had come. And somehow, in our final push to get the paper out, we left this out. I will be posting this information as soon as possible here and on the PLoS One site.

    In addition, after submitting the revision of our paper, we realized that we might be able to do a deeper analysis on one aspect of the work – how RpoB homologs from unusual DNA viruses compared to our novel sequences. We had included some RpoBs from DNA viruses in our analyses but not all that were available. So Dongying Wu did a very rapid additional analysis, adding some additional RpoB homologs to our alignment and making a tree of them. We then wrote a Google Knol about this new tree and submitted the Knol to PLoS Currents “Tree of Life” where it is currently in review. We are publishing the preprint of this Knol to make it available to all even while it is in review.


    Figure 2 from Wu and Eisen submitted. 

    Data availability

    There is a move afoot to make sure all data/tools associated with publications are readily available. We used publicly available sequence data and as much as possible publicly available tools for our work . We are trying to release as much as possible to allow people to re-analyze our work and to do any of the work themselves. We have therefore made use of the Dryad Data deposition service to post some of this material (see http://datadryad.org/handle/10255/dryad.8385).

    Who was involved

    • Dongying Wu a brilliant “Project Scientist” in my lab led the project (Project Scientist is one of the UC positions that is like what others call “Senior Scientist”). Dongying is simply one of the best bioinformaticians/computational biologists I have ever met. He was first author on many key papers from my lab including the Genomic Encyclopedia paper that came out last year and the glassy winged sharpshooter symbionts paper that came out a few years ago. Dongying worked in my group at TIGR and moved with me to UC Davis and currently splits his time between UC Davis and the DOE Joint Genome Institute. 
    • Martin Wu. Martin is an Assistant Professor at the University of Virginia. Prior to that he was a Project Scientist in my lab at Davis and a post-doc in my lab at TIGR. He is also a phenomenal bioinformatician / computational biologist. He developed the AMPHORA software in my lab and also led many genome projects (back when sequencing a genome was hard …) including that of the first Wolbachia genome and that of a very unusual bug Carboxydothermus hydrogenoformans. Martin helped with some of the genome analyses as part of this work. 
    • Aaron Halpern, Doug Rusch and Shibu Yooseph are all bioinformaticians from the J. Craig Venter Institute (Aaron is no longer there). All three helped with different aspects of dealing with and analyzing the GOS data and all three have been remarkably patient as this work dragged on and on. 
    • Marv Frazier from the JCVI was helpful in the initial set up and conceptualization of the project. 
    • J. Craig Venter is, well, Craig Venter, and he was involved in multiple aspects of the project including thinking about how and where to look for unusual sequences and interpreting some of the results.

    UPDATE: Funding for this work

    Most of my labs early work on this project was supported by a grant we had from the Assembling the Tree of Life program at the National Science Foundation (grant 0228651 to me and Naomi Ward). In that project we were working on sequencing and analyzing genomes from phyla of bacteria for which genomes were not available at the time. As part of this work we were designing methods to build phylogenetic trees from metagenomic data because we thought that our new genomes would be very useful in helping analyze metagenomic reads and figure out from which phyla they came. Later work on the project was supported by a grant to me, Jessica Green and Katie Pollard from the Gordon and Betty Moore Foundation (grant 1660).

    Some questions that might be asked and some answers (based in part on questions I have gotten from reporters). Note if you have other questions please post them here or on the PLOS One site for the paper.

    • Why no press release? Well, in part, because I sent information too late (shocking I know) to the Davis Press Office. But also because they have gotten suddenly busy with some Japan earthquake related actions. But also because, well, I really hate a lot of press releases. And finally, my brother had dinner with Carl Zimmer recently and apparently they discussed the possibility of having no press releases associated with papers. So here goes …. 
    • Really – what took so long? I would like to say the US Government made us hold back on publishing this until they could look into whether Venter collected ocean data from Roswell, NM or not. But really, the story above is true. We just did not get it done earlier. 
    • Why do you not know the source of the DNA (i.e., cells, viruses, etc)? This is why there was a six year wait between discovery and writing this up. We kept thinking we would be able to find the organisms but since I moved from TIGR and started a new job, we just never got around to getting to the source. We therefore decided to open this up to others who will hunt for the source by writing up the paper. 
    • Why did you not rename the Unknown 2 group in the RecA tree? We should have renamed our group “Thaumarchaeota” or something like that. When we did the initial analysis our group was novel. And then a few years ago a few groups obtained data from what is thought to be the third major lineage of Archaea – referred to by some as Thaumarchaeota. This is to go with the Euryarchaeota and Crenarchaeota. See http://www.ncbi.nlm.nih.gov/pubmed/20598889 for example. 
    • One of the clades in the RecA tree (XRCC2) seems out of place phylogenetically. I can see how that is confusing. The XRCC2 clade is very weird and hard to figure out. It is not the “normal” eukaryotic genes – those are the Rad51/DMC1 genes. One complication with the RecA family is that there have been duplication events to go with the species evolution. And thus eukaryotes have Rad51, DMC1, Rad51B, Rad51C, Rad57, XRCC3 and XRCC2. We tried to figure out where the XRCC2 group should go but it just was hard to place. The statistical support for its position (we used a method called bootstrapping) is low (note the lack of a number on the node where the branch leading to XRCC2 connects to the base of the tree). Most likely that group should be placed with some of the other eukaryotic groups. However, it seems likely that there was a duplication in the lineage leading up to the ancestor of eukaryotes and archaea (some studies have indicated they share a common ancestor to the exclusion of bacteria). Such a duplication would explain why basically all archaea have a RadA and and RadB and all / most eukaryotes have multiple paralogs as well. 
    • The Unknown 1 group in the RpoB RecA tree seems to group with phage. What can you say about that? We think unknown 1 is potentially of viral origin but still cannot tell. The fact that is clusters with RecA superfamily members from phage suggests this but it is distant enough from known phage for us to not be confident in any predicted origin. As for derivative forms vs. independent branch – this is one of the big questions about viruses these days. Many viruses encode homologs of “housekeeping” genes found across bacteria, archaea and eukaryotes. And in many cases the viral versions of these genes appear to phylogenetically very novel. This is why the people studying mimivirus (which we refer to) suggest some viruses may in fact represent a fourth branch on the tree of life. It is possible that some viruses are in fact reduced forms of what were once cellular organisms – akin to parasitic intracellular species of bacteria possibly. 
    • Why are these phylogenetically novel sequences so low in abundance? This is a key question. I think it would be easy to come up with a theory for these being rare or these being common. They might be rare if their niche is very limited today. Or they might be rare because they could not be very competitive with other organisms. Or they could be rare because they require some unusual interactions with other taxa. In addition, we have only looked carefully at ocean water samples. If these are common somewhere else (e.g., hotsprings, deep subsurface, etc) we would not yet have figured that out. We are looking at additional metagenomic data right now to see fi we can find any locations where relatives of these genes are more common

    Some related papers by others worth looking at

    Some related papers by me possibly worth looking at

    Some related blog posts I have written over the years

      http://friendfeed.com/treeoflife/5535e8ed/story-behind-of-my-new-plosone-paper-on-stalking?embed=1

      Dongying Wu, Martin Wu, Aaron Halpern, Douglas B. Rusch, Shibu Yooseph, Marvin Frazier,, & J. Craig Venter, Jonathan A. Eisen (2011). Stalking the Fourth Domain in Metagenomic Data: Searching for, Discovering, and Interpreting Novel, Deep Branches in Marker Gene Phylogenetic Trees PLoS One, 6 (3) : 10.1371/journal.pone.0018011

      Story behind the science: #PLoS Genetics "Evolutionary mirages" paper

      ResearchBlogging.org

      So there is this cool new paper out in PLoS Genetics: Evolutionary Mirages: Selection on Binding Site Composition Creates the Illusion of Conserved Grammars in Drosophila Enhancers. and I have wanted to write about it for a week or so. You see, the paper is about something I have been interested in for most of my career – how the particular processes by which mutations occur can sometimes be biased (i.e., some types of mutations are more common than others) and that these biases can create highly ordered patterns in genomes and in turn that observation of these ordered patters can sometimes be misinterpreted as being the result of adaptation. Mistaken claims of adaptation in genomics are a favorite topic of mine – and let me to create (with tongue in cheek) a new omics word – Adaptationomics.

      Anyway – so I really really like this paper. But there is a week bit of a problem in writing about it. You see, it is by my brother, Michael Eisen, a Prof. at UC Berkeley (and a student in his lab Richard Lusk). And, well, I don’t want to say anything wrong or stupid about the paper since, well, my brother will be pissed off. And so I have not written about it yet. But then I realized the best way to write about this one is to simply ask my brother for the “Story behind the science” for the paper, as I have been doing for some other recent papers.

      If you want a summary of the paper, here it is in their own words:

      Authors summary: Because mutation is a random process, most biologists assume that apparently non-random features of genome sequences must be the result of natural selection acting to create and preserve them. Where this is true, genome sequences provide a powerful means to infer aspects of molecular, cellular, and organismal biology from the signatures of selection they have left behind. However, recent analyses have shown that many aspects of genome structure and organization that have traditionally been attributed to selection can often arise from random processes. Several groups—including ours—studying the sequences that specify when and where genes should be produced have identified common, seemingly conserved, architectural features, based on which we have proposed new models for the activity of the complex molecular machines that regulate gene expression. However, in the work described here we simulate the evolution of these regulatory sequences and show that many of the features that we and others have identified can arise as a byproduct of random mutational processes and selection for other properties. This calls into question many conclusions of comparative genome analysis, and more generally highlights what Michael Lynch has called the “frailty of adaptive hypotheses” for the origins of complex genomic structures.

      Conclusions: Lynch has eloquently argued that biologists are often too quick to assume that organismal and genomic complexity must arise from selection for complex structures and too slow to adopt non-adaptive hypotheses. Our results lend additional support to this view, and extend it to show that indirect and non-adaptive forces can not only produce structure, but also create an illusion that this structure is being conserved. We do not doubt that many aspects of transcriptional regulation constrain the location of transcription factor binding sites within enhancers. Indeed a large body of experimental evidence supports this notion, and we remain committed to identifying and characterizing these constraints. But if this process is to be fueled by comparative sequence analysis, as we believe it must be, it is essential that we give careful consideration to the neutral and indirect forces that we now know can produce evolutionary mirages of structure and function.

      I must say I love the title lead in “Evolutionary mirages” which is another but much better way of saying “Adaptationism is a bad thing”.

      Anyway, before I get in any more trouble, here are some words about the paper from the Senior Author, Michael Eisen, my brother. Questions by me (I know, not very creative ones – but they will have to do):

      1. Why did you do this work?

      This paper started out as a control. My lab is interested in understanding how the enhancers that control gene expression work – focusing on those that control early development in Drosophila. In 2008, we published a paper showing that when we put enhancers from a distantly related family of flies into Drosophila melanogaster embryos, they drive patterns of expression that are identical to the endogenous D. melanogaster enhancers, even though they have almost no conservation of primary DNA sequence. But since they have the same function, they must have something in common – and so we compared the configurations of transcription factor binding sites in orthologous enhancers across different evolutionary timescales looking for something they shared.

      What we found is that binding sites in all of these enhancers occur in clusters. They are closer to each other than one would expect if they were scattered randomly in the ~1,000 bp of an enhancer. And, what’s more, sites that were close to each other were far more likely to be conserved. Surely, we thought, this could be no accident. So we proposed that enhancers are organized into compact clusters of sites for one or more factors – and that these “mini modules” are the primary unit of enhancer function.

      But as we worked to extend these analyses to whole genomes, we sought a more rigorous, quantitative assessment, of just how improbably different levels of binding site clustering were. Like pretty much everyone in the field, we had used a null model in which binding sites were scattered randomly in an enhancer. But, I’ve been working with genomes long enough to know that nothing is ever truly random – and that all kinds of adaptive and non-adaptive processes create patterns in genome sequences that confound simple analyses. I wanted to come up with a null model for the distribution of sites within in an enhancer that was more realistic.

      To do this I turned to my graduate student Rich Lusk, a card-carrying population geneticist trained at the University of Chicago. Rich was proud of his status as one of the few members of the lab who didn’t work on flies – but I convinced him to put aside the abstract models of binding site evolution in yeast and work on developing a real null model for our studies of enhancer evolution.

      The idea was to simulate enhancers evolving without any constraint on the organization of transcription factor binding sites they contain, and to see what happens. But this did not mean letting enhancers evolve neutrally – their extreme functional conservation demonstrates that they are under fairly strong constraint. Since it is pretty clear that these enhancers are responding to the same transcription factors in all of these species, Rich’s simulations required that enhancers maintain their binding site composition – but placed no constraints on how the sites were organized relative to each other.

      And what we found was striking. Even with no explicit selection on binding site organization – these evolved enhancers had lots of structure! Binding sites were clustered together, and, the closer together sites were, the more conserved they were — just like they were in real enhancers. In made us realize pretty quickly that the patterns we had latched onto – and which many other people were describing in different systems – might not be an evolutionary signature contraint on the organization of sites within in enhancers, but simply a byproduct of selection on binding site composition. If you want details, read the paper! But this has radically altered the way that we look at enhancer evolution.

      2. How did you come up with the title.

      Rich and I were writing the paper, and we had some really long, hideous, boring title. In writing the paper, the idea that things are not always what they appear to be was at the forefront of my mind. I was thinking about how desperate we and other people in the field were to figure out how enhancers work – it’s a vexing problem that has defied decades of work – and how we all hoped that evolutionary analysis was going to rescue us – and how quickly and eagerly we latched on to the first signs of a signal – and how that was just like a mirage you see in the desert….

      3. Any interesting background?

      (see 1)

      4. When did the work start?

      About a year ago. We had been thinking about this for a while, but only when Rich focused on it did things get rolling.

      5. Why PLoS Genetics? Did PLoS Biology reject it?

      PLoS Genetics was our first choice. PG has become the premier journal for evolutionary genetics – it routinely publishes the most interesting and important work in the field, and everyone reads it. While every paper I’ve sent there has been heavily scrutinized, the editorial process has been fair (though sometimes agonizingly slow….), and each review has been thoughtful and many (including in this case) helped to vastly improve the paper.

      Lusk, R., & Eisen, M. (2010). Evolutionary Mirages: Selection on Binding Site Composition Creates the Illusion of Conserved Grammars in Drosophila Enhancers PLoS Genetics, 6 (1) DOI: 10.1371/journal.pgen.1000829

      http://friendfeed.com/treeoflife/d5f1a668/story-behind-science-plos-genetics?embed=1

      Story behind the science: #PLoS Biology paper on cichlid vision evolution

      I am continuing on a new theme here in trying to get author feedback on recent PLOS publications.  Today I write about a recent paper on PLoS Biology “The Eyes Have It: Regulatory and Structural Changes Both Underlie Cichlid Visual Pigment Diversity” by Christopher Hofmann, Kelly O’Quin, N. Justin Marshall, Thomas Cronin, Ole Seehausen and Karen L. Carleton

      This paper discusses “how changes in gene regulation and coding sequence contribute to sensory diversification in two replicate radiations of cichlid fishes.” A good overview of the paper is in an accompanying article “Visual Tuning May Boost African Cichlid Diversity” by Robin Meadows:

      “African cichlid fish form new species faster than any other vertebrates, with hundreds of species evolving within the last 2 million years in Lake Malawi and within the last 120,000 years in Lake Victoria. This rapid speciation makes cichlids good models for elucidating the genetic mechanisms behind biodiversity. Vision may play a key role in cichlid evolution, adapting them to forage for new foods or colonize new habitats. Vertebrate retinas have two groups of light-sensitive proteins called opsins: those in rod photoreceptors, which are sensitive to dim light, and those in cone photoreceptors, which are sensitive to color. Changes in the visual system could be due to differences either in the expression of opsin genes or in their DNA sequences. A Research Article in this issue of PLoS Biology by Christopher Hofmann and colleagues suggests that both mechanisms underlie changes in visual sensitivity in cichlids.”

      For more on the science, see her summary and see the article itself. Additional information can be found in the press release from U. MD

      But what I wanted to cover here was some of the story behind the science.  So I emailed the authors some questions which they were kind enough to answer and I post the details here. There are some really interesting tidbits in these answers in my opinion, including how they dealt with merging two papers into one, and how difficult (but fun) it is to do this field work in Lake Malawi.

      1. What led you to do the study reported in the paper?

      From Karen Carlton:

      This study was a long time in the making.  We started studying the visual system of cichlids in the 1990’s.  We learned quickly that there was a lot of variation in opsin expression within the Lake Malawi species.  However, we had only examined a few species.  In 2005, Tom Cronin and Justin Marshall (world experts on aquatic visual systems) agree to come to Lake Malawi with us and help examine a greater number of species.  Justin brought his underwater spectrometer and characterized the light environment.  Tom and I measured fish colors (that paper is under review) and I extracted retina for quantifying gene expression.

      Because Lake Malawi and Lake Victoria both contain large cichlid radiations and had such different light environments, Ole Seehausen and I started working together in 2000 to compare visual systems in Malawi and Victoria.  (Ole is the world expert on Lake Victoria cichlids, having helped discover the large rock dwelling species flock that escaped the devastation of the Nile perch). We concentrated on opsin sequences in our previous publications.  However, we wanted to look at gene expression as well.

      I was fortunate in 2006 to move to U Maryland where Chris Hofmann and Kelly O’Quin joined in our efforts.  Chris took on the Victoria cichlid gene expression based on samples that Ole had collected.  Kelly became our statistical wizard and analyzed the Malawi data we had gathered.  (He has also been working on the visual system of Tanganyikan cichlids, which are the ancestors of the Malawi and Victoria flock.  This work has recently been submitted).

      From Kelly:

      I see Karen gave you a nice review of how this paper was started.  As she said, the work was started before I joined her lab.  At that time, we were primarily concerned with moving into the new lab at UMCP, so no one was actively working on the data set.  I initially analyzed the data to practice for a similar study of Tanganyikan cichlids.  But, as I learned more about the power (and pitfalls) of the comparative analysis, I became more and more involved with the actual analysis and discussions, and after about 6 months Karen asked me to write up the paper for the Lake Malawi data set.  At the same time Chris was working on a manuscript for the Victorian data.  After seeing the overlap in the two papers — really the similarities and differences — Karen and Chris and I decided it would be useful to put the two together.

      2. How did this group come together, with people from Australia, Switzerland and Maryland?

      From Karen:

      Vision science is a small international community that is wonderfully supportive.  The cichlid community is also small and makes for excellent collaborations.  This is what makes research great – combining expertise from such a diverse group of people.  This enables us to think across many disciplines from physics to biology and integrate light measurements, ecology, molecular biology and genetics to try and understand what drives cichlid visual communication and determine how it plays a role in speciation.

      From Christopher

      I would add that both Europe and Australia have some top people in the field of visual ecology.  Also, I don’t think we could have had a paper with such a broad scope without our collaborators.  Once we all got together things just kept building and was very exciting.

      3. A question for Kelly — how do you feel about the “joint contribution” statement.  Do you think there needs to be a system to truly list two first authors or do you think this statement will suffice? 

      From Karen

      I feel like I should chime in here.  We originally had written two separate papers with Chris as lead author on the Victoria data and Kelly on the Malawi data.  However, we all felt a combined paper could be more powerful.  I asked Kelly and Chris to combine these papers, though that was a very difficult thing to ask, particularly in these times of first author is best.  However, this paper is truly the joint effort of these two as well as the rest of the authors and would not be the paper that it is without everyone’s contributions and perspectives.

      From Kelly

      It is nice to be recognized for the work and effort given, and presumably this is accomplished in the ‘Author Contributions’ statement as well as the order in which authors are listed in.  For this paper, Chris and I each authored manuscripts that Chris had to painstakingly combine.  After a lot of debate over the meaning and limits of our comparative results, we each wrote a new drafts of the combined study that Karen then resolved into a single manuscript.  Tom, Justin, and Ole provided lots of  comments and additional text throughout this process as well.  This truly was a collaborative effort, with plenty of contribution and compromise on everyone’s part.  Although the order in which the author’s are listed cannot possibly communicate all of the nuances involved (though I am certainly happy with the order given), I hope we were able to addressed them with the ‘joint contribution’ statement you mention, as well as our ‘Author contributions’ statement (which lists just about every author under each category).

      In short, I don’t think a simple change to the way that we list authors will ever capture all of the individual and combined efforts that go into a study.  Instead, I think we need to change the way we read and interpret that list.

      4. How did you end up choosing PLoS Biology as a place to submit the paper to? Were there any debates among the group about publishing there?

      From Karen:

      Online journals, such as PLoS Biology, give us a lot of flexibility to include all the supporting data without limiting the length of the paper.

      From Kelly:

      Since we had essentially two large studies here, the generous space and supplemental information limits allowed by PLoS made it a natural choice to publish in.

      5. Do you have any good stories about the field work?  

      From Karen:

      Field work in Malawi is never dull.  Getting there is the first problem.  It is a 24 hr plane ride if all goes well (which it never does) plus a 5 hr drive down to the lake, partly on Malawi dirt roads.  Once you get there, however, the lake is a beautiful place.  The diving is about the best in the world and it is wonderful to immerse yourself in your organism’s habitat.  Underwater, it is wall to wall fish, with 50 or more species in a single location so it is perfect for observing and collecting a wide diversity of species.

      The field station is run by the University of Malawi. It is right next to Chembe village and the people there are incredibly warm and friendly.  The research station has electricity and cold running water.  This is very high living for the village and makes for an interesting dichotomy.  Several of the villagers are experts on cichlid fish, including Richard Zatha, and they dive with us.  They can catch fish far faster than we can. There is considerable wildlife including the baboons which like to come into the house and steal bread off the table.  We were fortunate in not having to deal with hippos or crocodiles on either of our recent trips.

      It is quite expensive to take a group to Malawi.  However, it is essential for everyone to see their organism in its natural habitat.  It also takes a lot of preparation as well to get a group of scuba divers certified and ready to do this kind of field work.

      I’m sure Ole has comparable stories for his work in Lake Victoria.

      From Christopher:

      To build on what Karen said.  Going to Lake Malawi and actually diving with the fish is an incredible experience.  When we work in our aquaculture facility we have maybe a handful of fish from a few different species in a single tank.  In the field, once you drop below the surface it is an entirely different world.  There are literally hundreds if not thousands of fish from many different species all doing their own thing.  Some are eating algae, others plankton and even other fish.  Many of these species are ones that are impossible to keep or breed in captivity, which makes the challenges of getting there worthwhile.

      From Kelly:

      Not really other than to say that it is a lot of hard work.  But if you like SCUBA diving in remarkably clear water with beautiful, colorful fish, I can’t think of a better place to work than Lake Malawi.

      6. Can you provide links to web sites of the authors and or other links of interest such as videos of the fish, twitter pages, etc?
      7. Anything else you want to add:
      From Christopher:

      I would also add that its not easy to catch fish in Malawi.  There is a definite art to scuba diving and handling a net.  Having local cichlid experts was invaluable.

      ———————–
      Cichlid picture by Christopher Hofmann doi:10.1371/journal.pbio.1000267.g001

      ResearchBlogging.org

      Meadows, R. (2009). Visual Tuning May Boost African Cichlid Diversity PLoS Biology, 7 (12) DOI: 10.1371/journal.pbio.1000267

      Hofmann, C., O’Quin, K., Marshall, N., Cronin, T., Seehausen, O., & Carleton, K. (2009). The Eyes Have It: Regulatory and Structural Changes Both Underlie Cichlid Visual Pigment Diversity PLoS Biology, 7 (12) DOI: 10.1371/journal.pbio.1000266

      More coverage of the GEBA "Phylogeny Driven Genomic Encyclopedia"

      Just a quick note here to post some links to additional stories about my new paper on “A phylogeny driven genomic encyclopedia of bacteria and archaea” which was published last week in Nature (with a Creative Commons license – which is rare in Nature but is what they use for genome sequencing papers).

      Carl Zimmer has an article today in the New York Times “Scientists Start a Genomic Catalog of Earth’s Abundant Microbes”  about the paper and the project.  In the article he interviews me and Hans-Peter Klenk, who was a co-author and led the culturing part of the project.  A few things to note about this:

      • It is rare to have archaea mentioned in the New York Times.
      • There is a tree that goes along with the article which is a modified version of the tree we had in our paper.  I think theirs is very nice. Kudos to their artist
      • There is a quote by Norm Pace generally supportive of the project 
      • The article mentions the JGI Adopt a Microbe program and even has a shout out to Malcolm Campbell at Davidson College and his recent PLoS One paper where they discuss results from a project where they took one of the genomes from our project and used it as part of a course on genome annotation/analysis. 

      For some of the story behind the paper see my recent blog post “Story Behind the Nature Paper on ‘A phylogeny driven genomic encyclopedia of bacteria & archaea’ #genomics #evolution

      Other discussions worth checking out

      Also see

      ResearchBlogging.org

      Wu, D., Hugenholtz, P., Mavromatis, K., Pukall, R., Dalin, E., Ivanova, N., Kunin, V., Goodwin, L., Wu, M., Tindall, B., Hooper, S., Pati, A., Lykidis, A., Spring, S., Anderson, I., D’haeseleer, P., Zemla, A., Singer, M., Lapidus, A., Nolan, M., Copeland, A., Han, C., Chen, F., Cheng, J., Lucas, S., Kerfeld, C., Lang, E., Gronow, S., Chain, P., Bruce, D., Rubin, E., Kyrpides, N., Klenk, H., & Eisen, J. (2009). A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea Nature, 462 (7276), 1056-1060 DOI: 10.1038/nature08656

      Bakke, P., Carney, N., DeLoache, W., Gearing, M., Ingvorsen, K., Lotz, M., McNair, J., Penumetcha, P., Simpson, S., Voss, L., Win, M., Heyer, L., & Campbell, A. (2009). Evaluation of Three Automated Genome Annotations for Halorhabdus utahensis PLoS ONE, 4 (7) DOI: 10.1371/journal.pone.0006291

      Story Behind the Nature Paper on ‘A phylogeny driven genomic encyclopedia of bacteria & archaea’ #genomics #evolution

      ResearchBlogging.org

      Today is a fun day for me. A paper on which I am the senior author is being published in Nature (yes, the Academic Editor in Chief of PLoS Biology is publishing a paper in Nature, more on that below ..). This paper, entitled, “A phylogeny driven genomic encyclopedia of bacteria and archaea” represents a culmination of years of work by many people from multiple institutions. Today in this blog I am going to do my best to tell the story behind the paper – about the people and the process and a little bit about the science.

      First, a brief bit about the science in the paper. In this paper, we (mostly people at the Joint Genome Institute, where I have an Adjunct Appointment — but also people in my lab at UC Davis and at the DSMZ culture collection) did a relatively simple thing – we started with the rRNA tree of life as a guide. Then we identified branches in the bacterial and archaeal portions of this tree where there were no genome sequences available (or in progress) (this was done mostly by Phil Hugenholtz, Dongying Wu and Nikos Kyrpides) Next we searched for representatives of these “unsequenced” branches in the DSMZ culture collection (a collection of bacteria and archaea that can be grown in the lab). And we identified in total some 200 of these. And then the DSMZ (under the direction of Hans-Peter Klenk) grew these organisms and sent the DNA to the Joint Genome Institute. And then JGI turned on their genome sequencing muscle and sequenced the genomes of the organisms in the DNA samples. And finally, we spent a good deal of time then analyzing the data asking a pretty simple question – are there any general benefits that come from this “phylogeny driven” approach to sequencing genomes compared to what one might find with sequencing just any random genome (after all, any genome sequence could have some value)? The paper, describes what we found, which is that there are in fact many benefits that come from sequencing genomes from branches in the tree for which genomes are not available.

      More on the details of the science below. But first, I want to note that this paper was truly an amazing team effort, with all sorts of people from the JGI in particular, going above and beyond the call of duty to make sure it happened and worked well. And the Department of Energy has been truly phenomenal in my opinion in supporting this project which in the end is not explicitly about “energy” per se but is really about providing a reference set of genomes that should improve the value of all microbial genome data.

      Anyway, now for the story behind the story. And be prepared, because this is a bit long. But I think it is important to place this work in a bigger context both in terms of my background as well as some of the background of other people in the project. If you can’t wait for more on the GEBA project then perhaps you should go to some of these links:

      And I will post more links as they come up. Below what I try to provide is some of the story behind the story:

      My personal interest in applied uses of phylogenetics stage 1: undergraduate preparation at Harvard
      As this paper is primarily about an applied use of phylogenetics (in selecting genomes for sequencing), I thought it would be worth going into some of how I personally became a bit obsessed with applied uses of phylogenetics. For me, my obsession began as an undergraduate at Harvard where I got exposed to the value of phylogeny as a tool from many many angles including but not limited to:

      • Freshman year taking a course taught by Stephen Jay Gould where Wayne and David Maddison were Teaching Assistant’s and where they were demoing their new phylogenetics software called MacClade
      • Sophomore year taking a conservation biology class with Eric Fajer and Scott Melvin where I was exposed to the concept of “phylogenetic diversity” as a tool in assessing conservation plans
      • Junior year working in the lab of Fakhri Bazzaz with people like David Ackerly and Peter Wayne who made use of phylogeny as a key tool in their research projects
      • Senior year and the year after graduating where I worked in the lab of Colleen Cavanaugh using rRNA based phylogenetic analysis to characterize uncultured chemosynthetic symbionts. I note it was in Colleen’s lab that I also became obsessed you could say with microbes and why they rock.
      My personal interest in applied uses of phylogenetics stage 2: graduate school at Stanford
      All of this and more gave me a strong passion for phylogeny as a tool. And so when I went to graduate school at Stanford (originally to work with Ward Watt on butterflies, but then I switched to working in Phil Hanawalt‘s lab on the “Evolution of DNA repair genes, proteins and processes“). And while in that lab I become pretty much obsessed with three things, all related to phylogeny.
      First, I was interested in whether the rRNA tree of life, which I had used in my studies in Colleen Cavanaugh’s lab (and in my first paper in J. Bacteriology, which, thanks to ASM, is now in Pubmed Central and free at ASM’s site too), was robust or, as some critics argued, was not that useful. This was a critical question since the best way to study the phylogeny of microbes at the time, and also the best way to study uncultured microbes, was to leverage the ability to clone rRNA genes by PCR and then to build evolutionary trees of those rRNA genes. As part of my graduate work, I did a study where I compared the phylogenetic trees of rRNA to trees of another gene from the same species (I chose, recA). Surprisingly, despite the claims that the rRNA tree was not very useful and that different genes always gave different trees, if you compared the two trees ONLY where there was strong support for a particular branching pattern, the trees of the two genes were in fact VERY VERY similar (a finding that had been suggested previously by others, including Lloyd and Sharp)
      Second, I also became obsessed with the fact that most of the experimental studies of DNA repair processes were in a very narrow sampling of the phylogenetic diversity of organisms (e.g., at the time, no studies had been done in archaea, and most studies in bacteria were from only two of the many major groups). So I started experimental studies of repair in halophilic archaea in order to help broaden the diversity of studies. And I began to use PCR to try and clone out repair genes from various other species of diverse bacteria and archaea. Alas, as I was doing this, some institute called TIGR was sequencing the complete genomes of organisms I was trying to clone out single genes from. In fact, one of the first organisms I was working on for PCR studies was Archaeoglobus fulgidus. And when I found out TIGR was sequencing the genome, in a project led by non other than the great microbial evolutionary biologist Hans-Peter Klenk (yes, the same one who helped us in this GEBA project). I decided it was silly to try to clone out individual genes by PCR. And thus I began to learn how to analyze genomes.
      It was in the course of learning how to analyze genomes that I came up with another applied use of phylogeny. I realized that one should be able to use phylogenetic studies of genes to help in predicting functions for uncharacterized genes as part of genome annotation efforts. And so I wrote a series of papers showing that this in fact worked (I did this first for the SNF2 family of proteins and then alas coined my own omics word “phylogenomics” to describe this integration of genome analysis and phylogenetics and formalized this phylogenomic approach to functional prediction). I note that what I was arguing for was that protein function could be treated like ANY other character trait and one could use character trait reconstruction methods (which I had learned about while playing with that MacClade program) to infer protein functions for unknown proteins in a protein tree. I note that this notion of predicting protein function from a protein tree is completely analogous to (and one could rightfully say stolen from) how researchers studying uncultured microbes were trying to infer properties of microbes from the position of their rRNA genes in the rRNA tree of life (as I had learned in studies of symbioses).
      My personal interest in applied uses of phylogenetics stage 3: my plans for a post doc
      So as I was wrapping up graduate school I was seeking a way to go beyond what I was doing and combine studies of DNA repair and evolution and microbiology in another way. And I thought I had found a perfect one in a post doc I accepted with A. John Clark at U. C. Berkeley. John was the person who had discovered recA, the gene I had been using for phylogenetic analysis and for structure function studies. And he was working with none other than Norm Pace and a young hotshot in Norm’s lab, Phil Hugenholtz (as well as a few others including Steve Sandler) in trying to use the recA homolog in archaea as a marker for environmental studies of archaea. It sounded literally perfect. And so I was excited to start this job. That was, until I met Craig Venter.
      Grabbing the TIGR by the tail
      While I had been playing around with data from TIGR in the latter years of my time in graduate school, I also got involved in teaching a fascinating class with David Botstein, Rick Myers, David Cox and others. (As an aside, this class was part of a new initiative I helped design at Stanford on “Science, Math and Engineering” for non science majors – an initiative that was a pet project of non other than Condie Rice who was Provost at the time). Anyway, Rick Myers was serving as a host for one Craig Venter when he came and gave a talk at Stanford and somehow I managed to finagle my way into being invited to go out to dinner with Craig. And at dinner, I proceeded to tell Craig that I thought some of the evolution stuff he was talking about was bogus and I tried to explain some of my work on phylogeny and phylogenomics. Not sure what Craig thought of the cocky PhD student drawing evolutionary trees on napkins, but it eventually got me a faculty job at TIGR and I worked extensively with Craig so it must have been worth something. And so I and my fiancé Maria-Inés Benito (now wife …) moved to Maryland and spent eight great years there (my wife started in MD as a faculty member at TIGR too, but then she left to go to a company called Informax, may it rest in peace).

      Most of my work at TIGR focused on a different side of phylogenomics than represented in the GEBA project. At TIGR I focused on the uses of evolutionary analysis as a component to analyzing genomes – from predicting gene function to finding duplications (e.g., see the V. cholerae genome paper) to identifying genes under unusual patterns of mutation or selection to finding organelle derived genes in nuclear genomes (e.g., see this) to studying the occurrence of lateral gene transfer or the lack of occurrence of it to studying genome rearrangement processes.. And sure, every once in a while I worked on a project where the organism was the first in its major branch to have a genome sequenced (e.g., Chlorobi). And I had noted, along with others that there was a big phylogenetic bias in genome sequencing project (e.g., see my 2000 review paper discussing this here).

      But that did not really drive my thinking about what genome to actually sequence until TIGR hired a brilliant microbial systematics expert Naomi Ward as a new faculty member. And it was Naomi who kept emphasizing that someone should go about targeting the “undersequenced” groups in the Tree of Life.

      NSF Assembling the Tree of Life grant
      And so Naomi and I (w/ Karen Nelson and Frank Robb) put together a grant for the NSF’s “Assembling the Tree of Life” program to do just this – to sequence the first genomes from eight phyla of bacteria for which no genomes were available but for which there were cultured organisms. Amazingly we got the grant. And we did some pretty cool things on that project, including sequencing some interesting genomes, and developing some useful new tools for analyzing genomes (e.g., STAP, AMPHORA, APIS). And I was able to hire some amazing scientists to work in my lab on the project including Dongying Wu (the lead author on the GEBA paper) and Martin Wu (who also worked on the GEBA project and is now a Prof. at U. Virginia) and Jonathan Badger. Alas, we did not publish any earth shattering papers as part of this NSF Tree of Life project on analyzing the genomes of these eight organisms, not because there was not some interesting stuff there but for some other reasons. First, I moved to UC Davis and there was a complicated administrative nightmare in transferring money and getting things up and running at Davis on this project so my lab was not really able to work on it for two years (in retrospect, what a f*ING nightmare dealing with the UC Davis grants system was …).

      Then, just as things we ready to get restarted, TIGR kind of imploded and many of the people, including Naomi, my CoPI, left (though I note, my moving to Davis was unrelated to the dissolution of TIGR). But perhaps most importantly, there were some actual technical and scientific problems with our dreams of changing the world of microbiology from our phyla sampling project – the science was not quite there. In particular, having a single genome from each of these phyla was simply not enough to get (and show) the benefits that can come from improved sampling of the tree of life. And thus though we have published some cool papers from this project (e.g., see this PLoS One paper on one of the genomes), we all ended up in one way or another, disappointed with the final results.

      Davis and JGI: the return of phylogeny to genomic sampling
      When I moved to UC Davis I also was offered (and accepted) an Adjunct Appointment at the Joint Genome Institute (JGI). At both places, I envisioned reinventing myself as someone who worked on studying microbes directly in the environment (e.g., with metagenomics) and symbioses (both of which I had started on at TIGR). And in fact, in a way, I have done this, since I got some medium to big grants to work on these issues. I tried diligently to attend weekly meetings at the JGI but it was difficult since technically I was 100% time at UC Davis and was in essence supposed to be at 0% time at JGI. And when JGI hired Phil Hugenholtz to run their environmental genomics/metagenomics work, I was needed less at JGI since, well, Phil was so good. It was great to go over there and interact with Eddy Rubin, Phil Hugenholtz, and Nikos Kyrpides, among others, but it was unclear what exactly I would do there with Phil running the metagenomics show.

      And then, like magic, something came up. I went to one of the monthly senior staff meetings at JGI and while we were listening to someone on the speaker phone, Eddy Rubin handed me a note asking me what I thought about the proposal someone was making to sequence all the species in the Bergey’s Manual. And the light bulb of phylogeny went back on in my head. I told him (I think I wrote it down, but may have said out loud), something like “well, sequencing all 6000 or so species would be great, but it would be better to focus on the most phylogenetically novel ones first.” And in a way, GEBA was born. Eddy organized some meetings at JGI to discuss the Bergey’s proposal and I argued for a more phylogeny driven approach. And this is where having Phil Hugenholtz and Nikos Kyrpides at JGI was like a perfect storm. You see, both had been lamenting the limited phylogenetic coverage of genomes for years, just like I had. Phil had even written a paper about it in 2002 which we used as part of our NSF Tree of Life proposal. And Nikos too had been diligently working for years to make sure novel organisms were sequenced. So though we went to a meeting to discuss the Bergey’s manual idea, we instead proposed an alternative – GEBA.

      And for some months, we pitched this notion to various people including at JGI, DOE, and various advisory boards. And the response was basically – “OK – sounds like it COULD be worth doing – why don’t you do a pilot and TEST if it is worth doing” And so, with support from Eddy Rubin and DOE, that is what we did.

      One key limitation – getting DNA

      So Phil, Nikos and I and a variety of others starting working on the general plan behind GEBA. But there was one key limitation. How were we going to get DNA from all these organisms? One possibility was to seek out diverse people in the community and have them somehow help us. This had some serious problems associated with it, not the least of which was the worry that we might end up sequencing varieties of organisms that people had in their lab but which nobody else had access to (something Naomi Ward and I had written about as a problem a few years before).

      And here came the second perfect storm – none other than Hans-Peter Klenk (yes, the same one who had led some of the early genome sequencing efforts when he was at TIGR), was visiting JGI. And he had a relatively new job – at the German Culture Collection DSMZ (In fact, I should note, I had tried to get a job at TIGR even before I met Venter, since they had a position advertised for a “microbial evolutionary biologist” — but that job went to Klenk). Phil Hugenholtz had asked the Head of DSMZ, Erko Stackebrandt, if they might be interested in helping us grow strains and get DNA but we did not yet have a full collaboration with them. And Erko had suggested we contact Hans-Peter. And in his visit to JGI it became apparent that he would do whatever he could to help us build a collaboration with DSMZ. And thus we had a source of DNA. Even more amazingly to me, they did it all for free.

      GEBA begins

      And thus began the real work in the project. Phil used his expertise with rRNA databases, especially GreenGenes, to pull out phylogenetic trees of different groups. And Nikos used his expertise as the curator of a database on microbial sequencing projects (called GenomesOnline) to help tag which branches in Phil’s tree had sequenced genomes or ones in progress. And then they looked for whether any of the members of the unsequenced branches had representatives in the DSMZ collection. And with some help from Dongying Wu and me, we came up with a list. And with the help of the JGI “Project Management” team including David Bruce and Lynne Goodwin and Eileen Dalin and others at JGI we developed a protocol for collaborating with DSMZ and getting DNA from them.

      And I became the chief cheerleader and administrator of the project, in part since Phil and Nikos were so busy with their other things at JGI. And though I was not always on the ball, the project moved forward and we started to get genomes sequenced using the full strength of the JGI as a genome center. The finishing teams at JGI worked diligently on finishing as many of the genomes as possible. And Nikos’ team at JGI made sure that the genomes were annotated. And I helped make sure that they data release policies were broadly open (which everyone at JGI supported). And after many false starts with papers on the project that were way way way to cumbersome and big, with some kicks in the pants from the director of JGI Eddy Rubin who was getting anxious about the project, we turned out the GEBA paper that was published today in Nature.

      You might ask, why, as a PLoS official and PLoS cheerleader, we ended up having a paper in Nature? Well, in the end, though I am senior author on the paper, the total contribution to the work mostly came from people at JGI who did not work for me but instead worked with me on this great project. And we took some votes and had some discussions and in the end, despite my lobbying to send it to PLoS Biology, submitting it to Nature was the group decision. I supported this decision in part due to the fact that Nature uses a Creative Commons license for genome papers. But I also supported it because in the end, this was a collaboration involving many many many people and in such projects everyone has to compromise here and there. Now mind you, I am not sad to have a paper in Nature. But I would personally have preferred to have it in a journal that was fully open access, not just occasionally open like Nature.

      Now I note, there were a million other things that went on associated with the GEBA project. Some of which I was not even involved in in any way. I will try to write about some of these another time, but this post is already way way way too long. So I am going to just stop here and add that I have been honored and lucky work with people like Phil, Nikos, Hans-Peter, and others on this project and to have the people at the JGI work so hard on the background parts of this project. Thanks to all of them and to the people at DSMZ and in my lab who helped out and to the DOE for funding this work (as well as the Gordon and Betty Moore Foundation, who funded some of the work from my lab on analysis of these genomes). And last but not least, thanks to the Director of JGI Eddy Rubin, supporting this project and for being patient with it and for kicking us in the pants when we needed to get moving on getting a paper out.

      Wu, D., Hugenholtz, P., Mavromatis, K., Pukall, R., Dalin, E., Ivanova, N., Kunin, V., Goodwin, L., Wu, M., Tindall, B., Hooper, S., Pati, A., Lykidis, A., Spring, S., Anderson, I., D’haeseleer, P., Zemla, A., Singer, M., Lapidus, A., Nolan, M., Copeland, A., Han, C., Chen, F., Cheng, J., Lucas, S., Kerfeld, C., Lang, E., Gronow, S., Chain, P., Bruce, D., Rubin, E., Kyrpides, N., Klenk, H., & Eisen, J. (2009). A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea Nature, 462 (7276), 1056-1060 DOI: 10.1038/nature08656