One of my new favorite things: paleovirology

Just a quick post here about a paper that came out about a month or so ago: PLoS Biology: Genomic Fossils Calibrate the Long-Term Evolution of Hepadnaviruses

This paper, by Clément Gilbert, Cédric Feschotte is quite cool.  In it they describe their work on “Paleovirology” where they look for viruses than have “endogenized” by inserting into the genome of some host species.  This endogenization is important in particular when the endogenous form becomes inactive and thus, in essence, trapped in the genome.  This in turn is important because many viruses evolve so rapidly when they are “free” that it is very hard to reconstruct their ancient history through comparative analysis.  But the endogenized viruses serve in essence as a molecular “fossil record” that aids in the comparison and phylogenetic analysis of various families of viruses.  As we get more and more genomes, this searching for and analysis of endogenous viruses will get much better.

Anyway, in the paper they report on endogenous viruses in the Zebra Finch genome that are in the Hepadnaviridae family.  Here is their summary:

Paleovirology is the study of ancient viruses and the way they have shaped the innate immune system of their hosts over millions of years. One way to reconstruct the deep evolution of viruses is to search for viral sequences “fossilized” at different evolutionary time points in the genome of their hosts. Besides retroviruses, few virus families are known to have deposited molecular relics in their host’s genomes. Here we report on the discovery of multiple fragments of viruses belonging to the Hepadnaviridae family (which includes the human hepatitis B viruses) fossilized in the genome of the zebra finch. We show that some of these fragments infiltrated the germline genome of passerine birds more than 19 million years ago, which implies that hepadnaviruses are much older than previously thought. Based on this age, we can infer a long-term avian hepadnavirus substitution rate, which is a 1,000-fold slower than all short-term substitution rates calculated based on extant hepadnavirus sequences. These results call for a reevaluation of the long-term evolution of Hepadnaviridae, and indicate that some exogenous hepadnaviruses may still be circulating today in various passerine birds.

Figure 4. Summary of the evolutionary scenario inferred in this study.

It is an interesting paper and worth a look if for those who have any interest in viral evolution. And I am becoming more and more fascinated by “Paleovirology” these days so I thought I would just post about this article here.  And I guess I am not alone in this opinion that the article is interesting (though I am late).  Here is some coverage of their paper:

Gilbert, C., & Feschotte, C. (2010). Genomic Fossils Calibrate the Long-Term Evolution of Hepadnaviruses PLoS Biology, 8 (9) DOI: 10.1371/journal.pbio.1000495

Lack of neutrality in bacteria and where pseudogenes go when they die

ResearchBlogging.org

Pseudogenes, which are in essence regions of the genome that used to be genes but no longer able to produce a functional unit, have long been considered to be models of the genetic equivalent of Switzerland’s neutrality. With this assumption of neutrality in hand, researchers have used studies of pseudogenes to better understand what happens to DNA when it is not visible to any form of natural selection. That is, pseudogenes have been thought to be neither harmful (as in, they are not under negative selection) or helpful (i.e., they are not under positive selection).

And from this assumption we have supposedly learned about mutation rates and patterns (because if they are neutral then the changes in pseudogenes should be reflective of mutational processes, not selection) as well as all sorts of other features of genome evolution.
Over the years, some have challenged the assumption of neutrality of pseudogenes (e.g., see here) like many have questioned whether Switzerland is really neutral. But overall, the feeling that pseudogenes were mostly neutral seems to have stuck. However, that may change a bit with a new paper from Chih-Horng Chu and Howard Ochman in PLoS Genetics (PLoS Genetics: The Extinction Dynamics of Bacterial Pseudogenes).
In their paper they report: (this is their authors summary)

Pseudogenes have traditionally been viewed as evolving in a strictly neutral manner. In bacteria, however, pseudogenes are deleted rapidly from genomes, suggesting that their presence is somehow deleterious. The distribution of pseudogenes among sequenced strains of Salmonella indicates that removal of many of these apparently functionless regions is attributable to their deleterious effects in cell fitness, suggesting that a sizeable fraction of pseudogenes are under selection.

Basically, what they did was the following
1. Compare Salmonella genomes. Identify putative pseudogenes and trace their evolution onto a phylogeny of the species.
Figure 1. Distribution of pseudogenes among Salmonellagenomes.
The phylogenetic tree was inferred from 2,898 single-copy genes shared by all fiveS. enterica subsp. enterica strains and the outgroup S. enterica subsp. arizonae.

doi:10.1371/journal.pgen.1001050.g001


2. Carry out a variety of analyses of the pseudogenes such as
  • looking at ratios of Ka/Ks (this is in essence a ratio of amino acid changes – aka non synonymous substitutions to “silent” synonymous changes which occur when the DNA sequence changes but the same amino acid is encoded).
  • examining the types and frequencies of gene inactivating mutations
3. Then they looked at the “ages” of pseudogenes – with age being estimated by the position in the tree in which the pseudogenes appear to have arise.
4. Finally the examined the age class distribution of pseudogenes as well as whether there were other differences between pseudogenes of different ages. And what they found was inconsistent with a neutral model. Instead, what they conclude is that something is making it advantageous to delete pseudogenes more rapidly than one might expect.
What explains this? After testing multiple possibilities the authors conclude that their is some negative selection against pseudogenes (or I guess positive selection for deletion of pseudogenes).
They conclude by suggesting this is likely to be pervasive across all bacteria and even in archaea. And furthermore make a connection to possible selection on intron size in eukaryotes. Anyway – the paper seems quite interesting and worth a read. Still pondering what it all means, so I would welcome comments.

Kuo, C., & Ochman, H. (2010). The Extinction Dynamics of Bacterial Pseudogenes PLoS Genetics, 6 (8) DOI: 10.1371/journal.pgen.1001050

Human genome project oversold? sure but lets not undersell basic science

Well, the piling on the human genome project continues, it seems at an accelerating pace.  I think most of this comes from the fact that we are in the range of the 10 year anniversary right now.   Here are some examples of recent stories suggesting the human genome project (or projects, if you count the public effort and Craig Venter’s effort as separate) have had little benefit:

  • 7/31/10: The Human Genome Project: 10 Years Later, Progress but Still a Puzzle – WNYC. Interesting piece by Sarah Kate Kramer discussing the limited clinical value of the HGP.  Includes some criticisms of personalized genomic medicine. 
  • 7/29/10: Spiegel interview with Craig Venter with the headline “We have learned nothing from the genome”.  Has lots of interesting tidbits.  Love the Venter line “Well, nobody likes to be beaten — by superior intelligence, planning and technology. That gets people upset.”  But I note Craig emphasizes the basic science value of human genome data.
  • 7/6/10: Public Radio mini story about Mike Mandel’s article on the failure of the human genome project.
  • 6/12/10: Nick Wade’s NY Times article on “A decade later, genetic map yields few new cures“.  In this Wade discusses many of the issues with both the sequencing of the human genome and some of the spinoff projects (and also butchers some evolutionary biology for which I gave him a twisted tree of life award). 
These are but a small sampling of the many many blogs, articles, and other reports that either directly state or suggest that much of the money spent on the human genome project was a waste.

Certainly, contrary to the suggestion of some of these articles, there have been some practical benefits that have come directly or indirectly from human genome sequencing.  But equally certainly, these critiques have a segment of truth to them in that the practical benefits have been few and far between.

Normally, one would not expect too many direct practical benefit to come from this kind of science project.  But alas, the problem here is that many of the key players (e.g., Eric Lander, Francis Collins, Craig Venter) in the sequencing of the human genome(s) oversold the potential benefits that could come from the sequencing.  In a way, it was their job to oversell the sequencing, since each was a cheerleader in ways for getting others to do a lot of work.

Many people knew at the time that this overselling was going on.  It was talked about extensively at various genome conferences and even occasionally in the press and scientific literature (boy do I wish I had had a blog then, because I was one of those people at conferences practically begging people to not oversell the benefits of the project – I now even give out an “overselling genomics award” on my blog ).  The cautionary voices were mainly saying that there was no need to oversell the project and that we should stick to the benefits of “knowing” ourselves and not guess about how it will lead to immediate cures for diseases.  And many said “If you oversell this now, it will come back to bite you

And thus it is not surprising to me that there is somewhat of a backlash now.  But there is a very dark side to the backlash that has potential to hurt science for many years to come.  If there is a need in the future for large scale science / medical projects, I can guarantee that some critics will step up and say things like “Well the war on cancer failed.  And the human genome project failed.  Why should we trust you now?

The problem here is that the human genome project should never have been sold as a means to a series of practical ends.  It should have been sold as a massive basic science project, much like going to the moon or building a giant linear accelerator.  That is, the human genome project was, and still really is, about knowledge.  It is about knowing ourselves.  It has enormous potential benefits in all sorts of areas, like human medicine.  It should greatly aid and abet studies of human biology and genetics and disease.  But given that benefits that come from such studies are impossible to predict, the human genome project should have been presented in a different way.  We need to discuss more in public why basic science is important even if one cannot predict what the benefits are.

In many ways, this is very much like the “war on cancer” which some have argued failed because we still have cancer killing a lot of people.  But this is off base because in fact the war on cancer has provided us with an incredible baseline of information about the biology of cancer.  We need to do a better job in all of these cases of defending the need for knowledge, and discussing how fighting cancer and curing diseases is not the same as building a big bridge or road.

The best person discussing this issue for the last ten or so years in my opinion has been Harold Varmus, who was once the head of NIH and is now the new director of the National Cancer Institute.  I have heard him repeatedly defending the “war on cancer” in terms of its basic science benefits.  For example see his comments on Science Friday 1/30/2009 and 7/16/2010.  There just have not been too many people doing a good job of this with genomics.  Venter and Collins have been OK here and there.  But we need more.

On a related note, we probably should have more discussion about how the money spent on the genome project and the war on cancer pales in comparison to money we spend on other things (e.g., interest on the national dept, wars, etc) but perhaps that is a side discussion.

Most importantly, we need to bring out to the public more of a discussion of the benefits from basic science. Here are some useful resources if you want to try and help:

I also encourage people to look at the National Academy of Sciences report A New Biology for the 21st Century: Ensuring the United States Leads the Coming Biology Revolution.  I note, I was one of the coauthors.  You can download the PDF of the whole document after giving your email address.
I am going to start a new series here on this blog called “Benefits of basic science” where I will be discussing these issues.  I encourage others out there to also bring more to the forefront discussions of the need for basic science.

——————–
UPDATE

Also see

Testing, testing – why we need more testing like this in genomic informatics & annotation methods

Just got an announcement regarding this challenge:

Automated Function Prediction SIG 2011 featuring the CAFA Challenge: Critical Assessment of Function Annotations | Automated Function Prediction 2011 July 15-16 2011, Vienna, Austria

Here is a description:

CAFA is a community-driven effort. We call upon computational function prediction groups to predict the function of a set of proteins whose true function is sequestered. At the meeting, we will reveal the functions, and discuss the predictions. The CAFA challenge goals are to foster a discussion between annotators, predictors and experimentalists about methodology as quality of functional predictions, as well as the methodology of assessing those predictions. Registration for CAFA starts July 15, 2010 and the CAFA challenge will take place September 15, 2010 through January 15, 2011.See here for more details on how you can enroll in CAFA.

This is near and dear to my heart as I have been working on methods to predict gene function from sequence for some 15 years now.  My first paper on this was in 1995 in which I showed that for genes in multigene families, phylogenetic trees of the gene family could help in predicting functions of uncharacterized members of the gene family.  More specifically, I suggested that the position of an uncharacterized gene in a gene tree relative to characterized genes could be used to predict its function.  I did this for one family in particular – the SNF2 family – but argued that it could be applied to other families.  (I think perhaps it was the first time someone had made this specific argument about using trees to predict function, but am not sure)

I then formalized this idea with a few papers (e.g., here and here) describing a “phylogenomic” approach to predicting function (alas, this is when I invented my first omics word).  And for many years since, I continued to work on functional prediction methods and continue to do so.  When I was at TIGR for eight years I did this both in my own research and helped others with their functional predictions.  I firmly believe that evolutionary approached approaches are critical in such functional prediction and have laid this out in a series of talks and papers (e.g., see this more recent one).

Anyway, enough about me.  I can argue all I want about how brilliant I am and about how evolutionary methods are the best approach.  But arguing is alas not science.  What we need are tests and experiments.  And that is where things like CAFA come in.  In CAFA one can test how well various functional prediction methods work.  And the people involved in CAFA (including organizers  Iddo FriedbergMichal Linial, and Predrag Radivojac and others such as Amos Bairoch, Sean Mooney, Patricia Babbitt, Steven Brenner, Christine Orengo and Burkhard RoshRost)) are to be commended for putting this together because we do not have a lot of these activities and need more in all aspects of genomics (and metagenomics too).  Others have discussed doing tests of functional prediction methods before, but I am not sure if any have happened per se.

Have a favorite functional prediction method?  Enter it in the competition or give a talk on it.  And if you are feeling inspired, organize a similar activity in your area of science – testing is a good thing.

See also Iddo Friedberg’s post about this

Holy lateral transfer batman; amazing story on fungal to aphid transfer from Nancy Moran

As many know, I generally do not write a lot about papers in non open access journal because I like readers to be able to access all the papers which I write about. But this is one of the exceptions to my normal rule. An amazing paper was published a few days ago in Science by Nancy Moran and Tyler Jarvik. Lateral Transfer of Genes from Fungi Underlies Carotenoid Production in Aphids — Moran and Jarvik 328 (5978): 624 — Science
I first found out about this from Ed Yong’s blog post here (just a note – his Not Exactly Rocket Science is such a frigging incredible blog). He really does the whole story on this so I am just posting a bit here.
Anyway Moran and Jarkiv paper focuses on genes in the aphid genome that encode enzymes for carotenoid synthesis. These enzymes are involved in red and/or green coloring seen in the pea aphids. Recently the pea aphid genome was sequenced (a paper about this was published in PLoS Biology ) and it was analysis of the genome data that helped lead Moran and Jarvik to the study reported in the recent issue of Science.
In their study they report a detailed evolutionary and phylogenetic analysis of the carotenoid synthesis genes found in the aphid genome and show quite convincingly that these genes do not appear to be of “normal” descent. That is, they seem to have an ancestry separate from many of the “normal” animal genes in the genome. Instead, these genes are related to genes from fungi. In fact, these genes are embedded in an evolutionary sense, in a group of genes which are all from fungi and thus Moran and Jarvik conclude the most likely explanation is that some time in relatively recent pea aphid evolutionary history, these genes were acquired from some fungus.
About to have some eye drops put in my eyes so gotta go for now, but just wanted to get something out there about this fascinating work. For more on this story – there is lots out there, such as the following:

Moran, N., & Jarvik, T. (2010). Lateral Transfer of Genes from Fungi Underlies Carotenoid Production in Aphids Science, 328 (5978), 624-627 DOI: 10.1126/science.1187113

. (2010). Genome Sequence of the Pea Aphid Acyrthosiphon pisum PLoS Biology, 8 (2) DOI: 10.1371/journal.pbio.1000313

ResearchBlogging.org

JGI User Mtg Day3 notes (coming up Rita Colwell, ex head of NSF)

Here are links to the Friendfeed Notes for today

http://friendfeed.com/jgi2010/7104a5aa/rita-colwell-university-of-maryland-solving?embed=1

http://friendfeed.com/jgi2010/f73db47f/joseph-noel-salk-institute-substrate?embed=1

http://friendfeed.com/jgi2010/1d5588e4/adrian-tsang-concordia-university-fungal?embed=1

http://friendfeed.com/jgi2010/4b31edd3/tanya-woyke-jgi-genomic-sequencing-of-single?embed=1

Genome Sequencer FLX Bay Area Regional User Group Meeting

Just got this email – may be of interest to some:

Genome Sequencer FLX Bay Area Regional User Group Meeting

Roche 454 Life Sciences invites you to the Bay Area Regional Genome Sequencer FLX User Group meeting which will be held at Roche Diagnostics in Pleasanton, CA on March 8th and 9th.

We will kickoff the meeting the afternoon of March 8 with interactive sessions with our BioInformatics Specialist Teri Mueller and Regional Applications Consultant Shamali Roy. Bring any questions or ideas you may want to address. Teri will present on the most recent upgrades to our Data Analysis software and Shamali will be available for experimental design advice. This will be a user driven question and answer event scheduled from 1:00-4:00 pm.

Tuesday will feature local scientists presenting their 454 Genome Sequencer FLX work for a variety of applications. Presentations will begin at 10:00 am and conclude at 4:00 pm.

Speakers for the event include

Robert Shaffer, MD – Associate Professor of Medicine and Pathology, Stanford University

Feng Chen, PhD – Group Lead, Technology and Applications, US DOE Joint Genome Institute

Matt Ashby, PhD – President and Chief Scientific Officer, Taxon Biosciences

Henry Erlich, PhD – Vice President Discovery Research, Roche Molecular Systems

This meeting is designed to allow GS FLX users to share experiences and knowledge about the platform, creating a community of users where tips and tools can be shared. Other than a brief introduction by Roche 454, this will be a Science for Scientists event. Please feel free to share this invitation with colleagues.

Come and network with other researchers using Roche 454 technology as well as learn tips ranging from sample management to base calling to whole genome analysis, and much more. This event is free of charge and is open to everyone with an interest in using this exciting technology for accelerating your research and discovery. Lunch will be provided on March 9.
Space is limited and registration is required. To register for the event please RSVP to either courtney.brady@roche.com or goli.shariat@roche.com Please indicate if you will be attending March 8th or 9th or both days.

The address for the event is Roche Diagnostics, 4300 Hacienda Drive, Pleasanton, CA 94588. There is a BART station nearby and shuttles to Roche can be arranged.

Story behind the science: #PLoS Genetics "Evolutionary mirages" paper

ResearchBlogging.org

So there is this cool new paper out in PLoS Genetics: Evolutionary Mirages: Selection on Binding Site Composition Creates the Illusion of Conserved Grammars in Drosophila Enhancers. and I have wanted to write about it for a week or so. You see, the paper is about something I have been interested in for most of my career – how the particular processes by which mutations occur can sometimes be biased (i.e., some types of mutations are more common than others) and that these biases can create highly ordered patterns in genomes and in turn that observation of these ordered patters can sometimes be misinterpreted as being the result of adaptation. Mistaken claims of adaptation in genomics are a favorite topic of mine – and let me to create (with tongue in cheek) a new omics word – Adaptationomics.

Anyway – so I really really like this paper. But there is a week bit of a problem in writing about it. You see, it is by my brother, Michael Eisen, a Prof. at UC Berkeley (and a student in his lab Richard Lusk). And, well, I don’t want to say anything wrong or stupid about the paper since, well, my brother will be pissed off. And so I have not written about it yet. But then I realized the best way to write about this one is to simply ask my brother for the “Story behind the science” for the paper, as I have been doing for some other recent papers.

If you want a summary of the paper, here it is in their own words:

Authors summary: Because mutation is a random process, most biologists assume that apparently non-random features of genome sequences must be the result of natural selection acting to create and preserve them. Where this is true, genome sequences provide a powerful means to infer aspects of molecular, cellular, and organismal biology from the signatures of selection they have left behind. However, recent analyses have shown that many aspects of genome structure and organization that have traditionally been attributed to selection can often arise from random processes. Several groups—including ours—studying the sequences that specify when and where genes should be produced have identified common, seemingly conserved, architectural features, based on which we have proposed new models for the activity of the complex molecular machines that regulate gene expression. However, in the work described here we simulate the evolution of these regulatory sequences and show that many of the features that we and others have identified can arise as a byproduct of random mutational processes and selection for other properties. This calls into question many conclusions of comparative genome analysis, and more generally highlights what Michael Lynch has called the “frailty of adaptive hypotheses” for the origins of complex genomic structures.

Conclusions: Lynch has eloquently argued that biologists are often too quick to assume that organismal and genomic complexity must arise from selection for complex structures and too slow to adopt non-adaptive hypotheses. Our results lend additional support to this view, and extend it to show that indirect and non-adaptive forces can not only produce structure, but also create an illusion that this structure is being conserved. We do not doubt that many aspects of transcriptional regulation constrain the location of transcription factor binding sites within enhancers. Indeed a large body of experimental evidence supports this notion, and we remain committed to identifying and characterizing these constraints. But if this process is to be fueled by comparative sequence analysis, as we believe it must be, it is essential that we give careful consideration to the neutral and indirect forces that we now know can produce evolutionary mirages of structure and function.

I must say I love the title lead in “Evolutionary mirages” which is another but much better way of saying “Adaptationism is a bad thing”.

Anyway, before I get in any more trouble, here are some words about the paper from the Senior Author, Michael Eisen, my brother. Questions by me (I know, not very creative ones – but they will have to do):

1. Why did you do this work?

This paper started out as a control. My lab is interested in understanding how the enhancers that control gene expression work – focusing on those that control early development in Drosophila. In 2008, we published a paper showing that when we put enhancers from a distantly related family of flies into Drosophila melanogaster embryos, they drive patterns of expression that are identical to the endogenous D. melanogaster enhancers, even though they have almost no conservation of primary DNA sequence. But since they have the same function, they must have something in common – and so we compared the configurations of transcription factor binding sites in orthologous enhancers across different evolutionary timescales looking for something they shared.

What we found is that binding sites in all of these enhancers occur in clusters. They are closer to each other than one would expect if they were scattered randomly in the ~1,000 bp of an enhancer. And, what’s more, sites that were close to each other were far more likely to be conserved. Surely, we thought, this could be no accident. So we proposed that enhancers are organized into compact clusters of sites for one or more factors – and that these “mini modules” are the primary unit of enhancer function.

But as we worked to extend these analyses to whole genomes, we sought a more rigorous, quantitative assessment, of just how improbably different levels of binding site clustering were. Like pretty much everyone in the field, we had used a null model in which binding sites were scattered randomly in an enhancer. But, I’ve been working with genomes long enough to know that nothing is ever truly random – and that all kinds of adaptive and non-adaptive processes create patterns in genome sequences that confound simple analyses. I wanted to come up with a null model for the distribution of sites within in an enhancer that was more realistic.

To do this I turned to my graduate student Rich Lusk, a card-carrying population geneticist trained at the University of Chicago. Rich was proud of his status as one of the few members of the lab who didn’t work on flies – but I convinced him to put aside the abstract models of binding site evolution in yeast and work on developing a real null model for our studies of enhancer evolution.

The idea was to simulate enhancers evolving without any constraint on the organization of transcription factor binding sites they contain, and to see what happens. But this did not mean letting enhancers evolve neutrally – their extreme functional conservation demonstrates that they are under fairly strong constraint. Since it is pretty clear that these enhancers are responding to the same transcription factors in all of these species, Rich’s simulations required that enhancers maintain their binding site composition – but placed no constraints on how the sites were organized relative to each other.

And what we found was striking. Even with no explicit selection on binding site organization – these evolved enhancers had lots of structure! Binding sites were clustered together, and, the closer together sites were, the more conserved they were — just like they were in real enhancers. In made us realize pretty quickly that the patterns we had latched onto – and which many other people were describing in different systems – might not be an evolutionary signature contraint on the organization of sites within in enhancers, but simply a byproduct of selection on binding site composition. If you want details, read the paper! But this has radically altered the way that we look at enhancer evolution.

2. How did you come up with the title.

Rich and I were writing the paper, and we had some really long, hideous, boring title. In writing the paper, the idea that things are not always what they appear to be was at the forefront of my mind. I was thinking about how desperate we and other people in the field were to figure out how enhancers work – it’s a vexing problem that has defied decades of work – and how we all hoped that evolutionary analysis was going to rescue us – and how quickly and eagerly we latched on to the first signs of a signal – and how that was just like a mirage you see in the desert….

3. Any interesting background?

(see 1)

4. When did the work start?

About a year ago. We had been thinking about this for a while, but only when Rich focused on it did things get rolling.

5. Why PLoS Genetics? Did PLoS Biology reject it?

PLoS Genetics was our first choice. PG has become the premier journal for evolutionary genetics – it routinely publishes the most interesting and important work in the field, and everyone reads it. While every paper I’ve sent there has been heavily scrutinized, the editorial process has been fair (though sometimes agonizingly slow….), and each review has been thoughtful and many (including in this case) helped to vastly improve the paper.

Lusk, R., & Eisen, M. (2010). Evolutionary Mirages: Selection on Binding Site Composition Creates the Illusion of Conserved Grammars in Drosophila Enhancers PLoS Genetics, 6 (1) DOI: 10.1371/journal.pgen.1000829

http://friendfeed.com/treeoflife/d5f1a668/story-behind-science-plos-genetics?embed=1

Wanted:Feedback on Importance of Finishing (Microbial) Genomes

To all

I am writing because I am working on a project to evaluate the importance of finishing microbial genomes. I know there has been lots of talk about this out there on the web and in papers, etc but I think a fresh discussion is useful. To get people up to speed below is a summary of the issue as I see it.

  1. Shotgun sequencing: Genome sequencing relies generally on the shotgun method at the beginning of a project where DNA fragments from an organism of interest are sequenced in a highly random manner.
  2. Assembly: After shotgun sequencing, the genome is assembled as best as possible into larger pieces (called contigs) and ordered sets of contigs (called scaffolds). All of this put together can be called an “assembly”
  3. Gaps: After the assembly phase, there are almost always gaps in the assembly. These generally come in two forms:
    • sequencing gaps (where we know two contigs go together in some orientation but where we do not know the sequence of the DNA in between the contigs)
    • physical gaps (where we have sets of scaffolds but do not know how the connect to each other).
  4. Quality: After the assembly phase, different components of the assembly can have different “qualities” where from example, some sections are somewhat ambiguous and others are highly reliable
  5. Finishing: Using any combination of laboratory, computational and other analyses one can both fill in gaps in the assembly and improve the quality of the assembly. This can generally be called “finishing
  6. Quality of final product: Depending on the end quality of the assembly we could assign it to one of a few categories of “completeness” as outlined in a paper by Patrick Chain et al. In essence, you can consider the post to be a follow up to their paper and their work.
We plan to try to measure what one gains by the finishing steps. We need to know this because we would like to make intelligent decisions about how to allocate resources. If one gains a lot from finishing then it would make sense to allocate significant resources to it. I note, I and some colleagues wrote a paper about this issue “The value of complete microbial genome sequencing (You get what you pay for)” that was published in 2002. This is without a doubt not the only discussion of the topic but I just wanted to point out I have been involved in this debate before. Despite that, I think we simply do not know right now what the benefits might be in the new sequencing landscape.
——————————————
So the question I am asking here is:

What do people think are the potential benefits that could come from finishing?

——————————————

Here are some possible answers to get the discussion going:
  1. Gene discovery (e.g., there may be interesting/important genes in missing/low quality data)
  2. Esthetics of completeness (as in, it just feels better to have a finished genome)
  3. Improved analysis of genome organization (in particular from having contigs oriented correctly)
Also – I note there has been some discussion of this for animals, plants etc (e.g., see recent paper by Eric Green and others on vertebrates) Many of the issues are similar but they are different enough that I think a microbe focused discussion is useful.
Other links of interest:

ResearchBlogging.org

Blakesley, R., Hansen, N., Gupta, J., McDowell, J., Maskeri, B., Barnabas, B., Brooks, S., Coleman, H., Haghighi, P., Ho, S., Schandler, K., Stantripop, S., Vogt, J., Thomas, P., Comparative Sequencing Program, N., Bouffard, G., & Green, E. (2010). Effort required to finish shotgun-generated genome sequences differs significantly among vertebrates BMC Genomics, 11 (1) DOI: 10.1186/1471-2164-11-21

Fraser, C., Eisen, J., Nelson, K., Paulsen, I., & Salzberg, S. (2002). The Value of Complete Microbial Genome Sequencing (You Get What You Pay For) Journal of Bacteriology, 184 (23), 6403-6405 DOI: 10.1128/JB.184.23.6403-6405.2002

Chain, P., & et al. (2009). Genome Project Standards in a New Era of Sequencing Science, 326 (5950), 236-237 DOI: 10.1126/science.1180614

Friendfeed discussion of this post:

http://friendfeed.com/treeoflife/4999d16e/wanted-feedback-on-importance-of-finishing?embed=1

Story Behind the Nature Paper on ‘A phylogeny driven genomic encyclopedia of bacteria & archaea’ #genomics #evolution

ResearchBlogging.org

Today is a fun day for me. A paper on which I am the senior author is being published in Nature (yes, the Academic Editor in Chief of PLoS Biology is publishing a paper in Nature, more on that below ..). This paper, entitled, “A phylogeny driven genomic encyclopedia of bacteria and archaea” represents a culmination of years of work by many people from multiple institutions. Today in this blog I am going to do my best to tell the story behind the paper – about the people and the process and a little bit about the science.

First, a brief bit about the science in the paper. In this paper, we (mostly people at the Joint Genome Institute, where I have an Adjunct Appointment — but also people in my lab at UC Davis and at the DSMZ culture collection) did a relatively simple thing – we started with the rRNA tree of life as a guide. Then we identified branches in the bacterial and archaeal portions of this tree where there were no genome sequences available (or in progress) (this was done mostly by Phil Hugenholtz, Dongying Wu and Nikos Kyrpides) Next we searched for representatives of these “unsequenced” branches in the DSMZ culture collection (a collection of bacteria and archaea that can be grown in the lab). And we identified in total some 200 of these. And then the DSMZ (under the direction of Hans-Peter Klenk) grew these organisms and sent the DNA to the Joint Genome Institute. And then JGI turned on their genome sequencing muscle and sequenced the genomes of the organisms in the DNA samples. And finally, we spent a good deal of time then analyzing the data asking a pretty simple question – are there any general benefits that come from this “phylogeny driven” approach to sequencing genomes compared to what one might find with sequencing just any random genome (after all, any genome sequence could have some value)? The paper, describes what we found, which is that there are in fact many benefits that come from sequencing genomes from branches in the tree for which genomes are not available.

More on the details of the science below. But first, I want to note that this paper was truly an amazing team effort, with all sorts of people from the JGI in particular, going above and beyond the call of duty to make sure it happened and worked well. And the Department of Energy has been truly phenomenal in my opinion in supporting this project which in the end is not explicitly about “energy” per se but is really about providing a reference set of genomes that should improve the value of all microbial genome data.

Anyway, now for the story behind the story. And be prepared, because this is a bit long. But I think it is important to place this work in a bigger context both in terms of my background as well as some of the background of other people in the project. If you can’t wait for more on the GEBA project then perhaps you should go to some of these links:

And I will post more links as they come up. Below what I try to provide is some of the story behind the story:

My personal interest in applied uses of phylogenetics stage 1: undergraduate preparation at Harvard
As this paper is primarily about an applied use of phylogenetics (in selecting genomes for sequencing), I thought it would be worth going into some of how I personally became a bit obsessed with applied uses of phylogenetics. For me, my obsession began as an undergraduate at Harvard where I got exposed to the value of phylogeny as a tool from many many angles including but not limited to:

  • Freshman year taking a course taught by Stephen Jay Gould where Wayne and David Maddison were Teaching Assistant’s and where they were demoing their new phylogenetics software called MacClade
  • Sophomore year taking a conservation biology class with Eric Fajer and Scott Melvin where I was exposed to the concept of “phylogenetic diversity” as a tool in assessing conservation plans
  • Junior year working in the lab of Fakhri Bazzaz with people like David Ackerly and Peter Wayne who made use of phylogeny as a key tool in their research projects
  • Senior year and the year after graduating where I worked in the lab of Colleen Cavanaugh using rRNA based phylogenetic analysis to characterize uncultured chemosynthetic symbionts. I note it was in Colleen’s lab that I also became obsessed you could say with microbes and why they rock.
My personal interest in applied uses of phylogenetics stage 2: graduate school at Stanford
All of this and more gave me a strong passion for phylogeny as a tool. And so when I went to graduate school at Stanford (originally to work with Ward Watt on butterflies, but then I switched to working in Phil Hanawalt‘s lab on the “Evolution of DNA repair genes, proteins and processes“). And while in that lab I become pretty much obsessed with three things, all related to phylogeny.
First, I was interested in whether the rRNA tree of life, which I had used in my studies in Colleen Cavanaugh’s lab (and in my first paper in J. Bacteriology, which, thanks to ASM, is now in Pubmed Central and free at ASM’s site too), was robust or, as some critics argued, was not that useful. This was a critical question since the best way to study the phylogeny of microbes at the time, and also the best way to study uncultured microbes, was to leverage the ability to clone rRNA genes by PCR and then to build evolutionary trees of those rRNA genes. As part of my graduate work, I did a study where I compared the phylogenetic trees of rRNA to trees of another gene from the same species (I chose, recA). Surprisingly, despite the claims that the rRNA tree was not very useful and that different genes always gave different trees, if you compared the two trees ONLY where there was strong support for a particular branching pattern, the trees of the two genes were in fact VERY VERY similar (a finding that had been suggested previously by others, including Lloyd and Sharp)
Second, I also became obsessed with the fact that most of the experimental studies of DNA repair processes were in a very narrow sampling of the phylogenetic diversity of organisms (e.g., at the time, no studies had been done in archaea, and most studies in bacteria were from only two of the many major groups). So I started experimental studies of repair in halophilic archaea in order to help broaden the diversity of studies. And I began to use PCR to try and clone out repair genes from various other species of diverse bacteria and archaea. Alas, as I was doing this, some institute called TIGR was sequencing the complete genomes of organisms I was trying to clone out single genes from. In fact, one of the first organisms I was working on for PCR studies was Archaeoglobus fulgidus. And when I found out TIGR was sequencing the genome, in a project led by non other than the great microbial evolutionary biologist Hans-Peter Klenk (yes, the same one who helped us in this GEBA project). I decided it was silly to try to clone out individual genes by PCR. And thus I began to learn how to analyze genomes.
It was in the course of learning how to analyze genomes that I came up with another applied use of phylogeny. I realized that one should be able to use phylogenetic studies of genes to help in predicting functions for uncharacterized genes as part of genome annotation efforts. And so I wrote a series of papers showing that this in fact worked (I did this first for the SNF2 family of proteins and then alas coined my own omics word “phylogenomics” to describe this integration of genome analysis and phylogenetics and formalized this phylogenomic approach to functional prediction). I note that what I was arguing for was that protein function could be treated like ANY other character trait and one could use character trait reconstruction methods (which I had learned about while playing with that MacClade program) to infer protein functions for unknown proteins in a protein tree. I note that this notion of predicting protein function from a protein tree is completely analogous to (and one could rightfully say stolen from) how researchers studying uncultured microbes were trying to infer properties of microbes from the position of their rRNA genes in the rRNA tree of life (as I had learned in studies of symbioses).
My personal interest in applied uses of phylogenetics stage 3: my plans for a post doc
So as I was wrapping up graduate school I was seeking a way to go beyond what I was doing and combine studies of DNA repair and evolution and microbiology in another way. And I thought I had found a perfect one in a post doc I accepted with A. John Clark at U. C. Berkeley. John was the person who had discovered recA, the gene I had been using for phylogenetic analysis and for structure function studies. And he was working with none other than Norm Pace and a young hotshot in Norm’s lab, Phil Hugenholtz (as well as a few others including Steve Sandler) in trying to use the recA homolog in archaea as a marker for environmental studies of archaea. It sounded literally perfect. And so I was excited to start this job. That was, until I met Craig Venter.
Grabbing the TIGR by the tail
While I had been playing around with data from TIGR in the latter years of my time in graduate school, I also got involved in teaching a fascinating class with David Botstein, Rick Myers, David Cox and others. (As an aside, this class was part of a new initiative I helped design at Stanford on “Science, Math and Engineering” for non science majors – an initiative that was a pet project of non other than Condie Rice who was Provost at the time). Anyway, Rick Myers was serving as a host for one Craig Venter when he came and gave a talk at Stanford and somehow I managed to finagle my way into being invited to go out to dinner with Craig. And at dinner, I proceeded to tell Craig that I thought some of the evolution stuff he was talking about was bogus and I tried to explain some of my work on phylogeny and phylogenomics. Not sure what Craig thought of the cocky PhD student drawing evolutionary trees on napkins, but it eventually got me a faculty job at TIGR and I worked extensively with Craig so it must have been worth something. And so I and my fiancé Maria-Inés Benito (now wife …) moved to Maryland and spent eight great years there (my wife started in MD as a faculty member at TIGR too, but then she left to go to a company called Informax, may it rest in peace).

Most of my work at TIGR focused on a different side of phylogenomics than represented in the GEBA project. At TIGR I focused on the uses of evolutionary analysis as a component to analyzing genomes – from predicting gene function to finding duplications (e.g., see the V. cholerae genome paper) to identifying genes under unusual patterns of mutation or selection to finding organelle derived genes in nuclear genomes (e.g., see this) to studying the occurrence of lateral gene transfer or the lack of occurrence of it to studying genome rearrangement processes.. And sure, every once in a while I worked on a project where the organism was the first in its major branch to have a genome sequenced (e.g., Chlorobi). And I had noted, along with others that there was a big phylogenetic bias in genome sequencing project (e.g., see my 2000 review paper discussing this here).

But that did not really drive my thinking about what genome to actually sequence until TIGR hired a brilliant microbial systematics expert Naomi Ward as a new faculty member. And it was Naomi who kept emphasizing that someone should go about targeting the “undersequenced” groups in the Tree of Life.

NSF Assembling the Tree of Life grant
And so Naomi and I (w/ Karen Nelson and Frank Robb) put together a grant for the NSF’s “Assembling the Tree of Life” program to do just this – to sequence the first genomes from eight phyla of bacteria for which no genomes were available but for which there were cultured organisms. Amazingly we got the grant. And we did some pretty cool things on that project, including sequencing some interesting genomes, and developing some useful new tools for analyzing genomes (e.g., STAP, AMPHORA, APIS). And I was able to hire some amazing scientists to work in my lab on the project including Dongying Wu (the lead author on the GEBA paper) and Martin Wu (who also worked on the GEBA project and is now a Prof. at U. Virginia) and Jonathan Badger. Alas, we did not publish any earth shattering papers as part of this NSF Tree of Life project on analyzing the genomes of these eight organisms, not because there was not some interesting stuff there but for some other reasons. First, I moved to UC Davis and there was a complicated administrative nightmare in transferring money and getting things up and running at Davis on this project so my lab was not really able to work on it for two years (in retrospect, what a f*ING nightmare dealing with the UC Davis grants system was …).

Then, just as things we ready to get restarted, TIGR kind of imploded and many of the people, including Naomi, my CoPI, left (though I note, my moving to Davis was unrelated to the dissolution of TIGR). But perhaps most importantly, there were some actual technical and scientific problems with our dreams of changing the world of microbiology from our phyla sampling project – the science was not quite there. In particular, having a single genome from each of these phyla was simply not enough to get (and show) the benefits that can come from improved sampling of the tree of life. And thus though we have published some cool papers from this project (e.g., see this PLoS One paper on one of the genomes), we all ended up in one way or another, disappointed with the final results.

Davis and JGI: the return of phylogeny to genomic sampling
When I moved to UC Davis I also was offered (and accepted) an Adjunct Appointment at the Joint Genome Institute (JGI). At both places, I envisioned reinventing myself as someone who worked on studying microbes directly in the environment (e.g., with metagenomics) and symbioses (both of which I had started on at TIGR). And in fact, in a way, I have done this, since I got some medium to big grants to work on these issues. I tried diligently to attend weekly meetings at the JGI but it was difficult since technically I was 100% time at UC Davis and was in essence supposed to be at 0% time at JGI. And when JGI hired Phil Hugenholtz to run their environmental genomics/metagenomics work, I was needed less at JGI since, well, Phil was so good. It was great to go over there and interact with Eddy Rubin, Phil Hugenholtz, and Nikos Kyrpides, among others, but it was unclear what exactly I would do there with Phil running the metagenomics show.

And then, like magic, something came up. I went to one of the monthly senior staff meetings at JGI and while we were listening to someone on the speaker phone, Eddy Rubin handed me a note asking me what I thought about the proposal someone was making to sequence all the species in the Bergey’s Manual. And the light bulb of phylogeny went back on in my head. I told him (I think I wrote it down, but may have said out loud), something like “well, sequencing all 6000 or so species would be great, but it would be better to focus on the most phylogenetically novel ones first.” And in a way, GEBA was born. Eddy organized some meetings at JGI to discuss the Bergey’s proposal and I argued for a more phylogeny driven approach. And this is where having Phil Hugenholtz and Nikos Kyrpides at JGI was like a perfect storm. You see, both had been lamenting the limited phylogenetic coverage of genomes for years, just like I had. Phil had even written a paper about it in 2002 which we used as part of our NSF Tree of Life proposal. And Nikos too had been diligently working for years to make sure novel organisms were sequenced. So though we went to a meeting to discuss the Bergey’s manual idea, we instead proposed an alternative – GEBA.

And for some months, we pitched this notion to various people including at JGI, DOE, and various advisory boards. And the response was basically – “OK – sounds like it COULD be worth doing – why don’t you do a pilot and TEST if it is worth doing” And so, with support from Eddy Rubin and DOE, that is what we did.

One key limitation – getting DNA

So Phil, Nikos and I and a variety of others starting working on the general plan behind GEBA. But there was one key limitation. How were we going to get DNA from all these organisms? One possibility was to seek out diverse people in the community and have them somehow help us. This had some serious problems associated with it, not the least of which was the worry that we might end up sequencing varieties of organisms that people had in their lab but which nobody else had access to (something Naomi Ward and I had written about as a problem a few years before).

And here came the second perfect storm – none other than Hans-Peter Klenk (yes, the same one who had led some of the early genome sequencing efforts when he was at TIGR), was visiting JGI. And he had a relatively new job – at the German Culture Collection DSMZ (In fact, I should note, I had tried to get a job at TIGR even before I met Venter, since they had a position advertised for a “microbial evolutionary biologist” — but that job went to Klenk). Phil Hugenholtz had asked the Head of DSMZ, Erko Stackebrandt, if they might be interested in helping us grow strains and get DNA but we did not yet have a full collaboration with them. And Erko had suggested we contact Hans-Peter. And in his visit to JGI it became apparent that he would do whatever he could to help us build a collaboration with DSMZ. And thus we had a source of DNA. Even more amazingly to me, they did it all for free.

GEBA begins

And thus began the real work in the project. Phil used his expertise with rRNA databases, especially GreenGenes, to pull out phylogenetic trees of different groups. And Nikos used his expertise as the curator of a database on microbial sequencing projects (called GenomesOnline) to help tag which branches in Phil’s tree had sequenced genomes or ones in progress. And then they looked for whether any of the members of the unsequenced branches had representatives in the DSMZ collection. And with some help from Dongying Wu and me, we came up with a list. And with the help of the JGI “Project Management” team including David Bruce and Lynne Goodwin and Eileen Dalin and others at JGI we developed a protocol for collaborating with DSMZ and getting DNA from them.

And I became the chief cheerleader and administrator of the project, in part since Phil and Nikos were so busy with their other things at JGI. And though I was not always on the ball, the project moved forward and we started to get genomes sequenced using the full strength of the JGI as a genome center. The finishing teams at JGI worked diligently on finishing as many of the genomes as possible. And Nikos’ team at JGI made sure that the genomes were annotated. And I helped make sure that they data release policies were broadly open (which everyone at JGI supported). And after many false starts with papers on the project that were way way way to cumbersome and big, with some kicks in the pants from the director of JGI Eddy Rubin who was getting anxious about the project, we turned out the GEBA paper that was published today in Nature.

You might ask, why, as a PLoS official and PLoS cheerleader, we ended up having a paper in Nature? Well, in the end, though I am senior author on the paper, the total contribution to the work mostly came from people at JGI who did not work for me but instead worked with me on this great project. And we took some votes and had some discussions and in the end, despite my lobbying to send it to PLoS Biology, submitting it to Nature was the group decision. I supported this decision in part due to the fact that Nature uses a Creative Commons license for genome papers. But I also supported it because in the end, this was a collaboration involving many many many people and in such projects everyone has to compromise here and there. Now mind you, I am not sad to have a paper in Nature. But I would personally have preferred to have it in a journal that was fully open access, not just occasionally open like Nature.

Now I note, there were a million other things that went on associated with the GEBA project. Some of which I was not even involved in in any way. I will try to write about some of these another time, but this post is already way way way too long. So I am going to just stop here and add that I have been honored and lucky work with people like Phil, Nikos, Hans-Peter, and others on this project and to have the people at the JGI work so hard on the background parts of this project. Thanks to all of them and to the people at DSMZ and in my lab who helped out and to the DOE for funding this work (as well as the Gordon and Betty Moore Foundation, who funded some of the work from my lab on analysis of these genomes). And last but not least, thanks to the Director of JGI Eddy Rubin, supporting this project and for being patient with it and for kicking us in the pants when we needed to get moving on getting a paper out.

Wu, D., Hugenholtz, P., Mavromatis, K., Pukall, R., Dalin, E., Ivanova, N., Kunin, V., Goodwin, L., Wu, M., Tindall, B., Hooper, S., Pati, A., Lykidis, A., Spring, S., Anderson, I., D’haeseleer, P., Zemla, A., Singer, M., Lapidus, A., Nolan, M., Copeland, A., Han, C., Chen, F., Cheng, J., Lucas, S., Kerfeld, C., Lang, E., Gronow, S., Chain, P., Bruce, D., Rubin, E., Kyrpides, N., Klenk, H., & Eisen, J. (2009). A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea Nature, 462 (7276), 1056-1060 DOI: 10.1038/nature08656