Calling all authors, bloggers, reporters – please help with aggregating discussions of scientific papers

There has been a lot of hand wringing over whether we should and if we should how should we link discussions about scientific papers with the papers themselves.  For example, if someone writes a news story about a paper in BMC Genomics – should the online version of the paper show links to the news story?  I think so, as so many others.  If someone writes a blog post discussing the paper, should there be a tracked link on the journal site?  Again, I think so.  I think this is even more important in social media discussions of papers, which I find fascinating much of the time.   Very few people go to journal sites and post comments and open up discussions of papers.  But lots of people post comments to twitter, facebook, and other social media sites.  So why not bring those posts into the fold?

Now there are lots of efforts out there to collect comments about papers.  Faculty of 1000.  The Third Reviewer.  Research blogging. And much more. For other discussions of the issue see:

I am not really going to get into a discussion of all the ideas out there in this area.  Some are good.  Some are bad.  Some are probably both.  I personally think aggregation is going to be a very useful tool in post publication sharing and discussion and searching.  But that is not per se what I am here to talk about.  What I am here to talk about is what anyone can do right now to help with this in a very simple way.

My simple suggestion:
If you see ANY online discussion of a paper – a news story – a blog – even some smaller thread somewhere.  Find the journal article online and use the comments function is the journal has one to post  a comment saying “There was a news story discussing this paper in the New York Times.  See ….”  Or something like that.  And presto, people who go see the paper online will also have potential to find the link you post.

I have been doing this for a while.  It is relatively easy for PLoS Papers.  For example for my paper on “Stalking the Fourth Domain in Metagenomic Data” I posted all sorts of links using the PLoS One comment function.  I posted links to my blog.  I posted links to positive news stories.  I posted links to critiques.  Anything I could find.

And this worked out pretty well.

I then started posted links for other papers, even pretty old ones (I just posted a few for my PLoS Biology paper in 2006 on the Tetrahymena genome).  I have now done this for many PLoS papers as well as my recent Nature paper on a “Phylogeny driven genomic encyclopedia of bacteria and archaea“.  Now, mind you, this works best when the papers are open access or at least freely available, so that people can read the paper as well as the discussions.  But you could do this for any paper in principle if the journal has a commenting function.

Now – I am not alone in starting to do this.  PLoS One has even launched this as a formal “media tracking” project: see PLoS ONE’s Media Tracking Project | EveryONE.  Not sure how well their system will work, but any aggregation is good.  Of course, in the long run, systems that aggregate automatically using trackbacks or DOIs or other tools will be best (e.g., some journals link to Research Blogging posts but not all do), but those do not always work perfectly and some journals do not seem to like the automated approaches.  So please – link and comment away.  Become part of the aggregation solution.  I know this is not all we need to do and this is a relatively minor thing – but if we get everyone engaged in doing it, I believe there will be a catalytic effect whereby people will then understand why this might be useful to do broadly …

More on ‘phylogenomics’ – as in functional prediction w/ phylogeny

There is a new paper out: Phylogenetic-based propagation of functional annotations within the Gene Ontology consortium in Briefings in Bioinformatics.

The paper is interesting and presents a new general approach to using phylogeny for functional prediction of uncharacterized genes. I am interested in this for many reasons including that I was one of, if not the first to lay this out as a concept.  In a series of papers from 1995-1998 I outlined how phylogenetic analysis could be used to aid in functional prediction for all the genes that were starting to be sequenced in genome projects without any associated functional studies (at the time, I referred to all these ESTs and other sequences as an “onslaught” – little did I know what was to come).

My first paper on this topic was in 1995: Evolution of the SNF2 family of proteins: subfamilies with distinct sequences and functions.  The abstract is below:

The SNF2 family of proteins includes representatives from a variety of species with roles in cellular processes such as transcriptional regulation (e.g. MOT1, SNF2 and BRM), maintenance of chromosome stability during mitosis (e.g. lodestar) and various aspects of processing of DNA damage, including nucleotide excision repair (e.g. RAD16 and ERCC6), recombinational pathways (e.g. RAD54) and post-replication daughter strand gap repair (e.g. RAD5). This family also includes many proteins with no known function. To better characterize this family of proteins we have used molecular phylogenetic techniques to infer evolutionary relationships among the family members. We have divided the SNF2 family into multiple subfamilies, each of which represents what we propose to be a functionally and evolutionarily distinct group. We have then used the subfamily structure to predict the functions of some of the uncharacterized proteins in the SNF2 family. We discuss possible implications of this evolutionary analysis on the general properties and evolution of the SNF2 family.



I note – I am annoyed that when I went to the Nucleic Acids Research site for my paper I discovered for some bizarre reason they are now trying to charge for access to it even though it is in Pubmed Central and used to be freely available on the NAR site.  WTF?  Is this just an IT issue like the #OpenGate complaints I made for a while about Nature Genome papers.

Anyway – in that paper in 1995 I basically showed that at least for this family, phylogenetic analysis could be used as a tool in making functional predictions by allowing one to better identify orthology relationships and subfamilies within the SNF2 superfamily.  This was novel I think maybe a little bit but others at the time were also looking into using various analyses to identify orthology relationships across genomes.

Shortly thereafter I started working on the concept that one could used the phylogenetic tree more explicitly in making functional predictions and eventually I laid out the concept of treating function as a character states and doing character state reconstruction using a gene tree to then infer functions for uncharacterized genes.  I called this approach “phylogenomics” in a paper in 1997 in Nature Medicine (the editor asked us to give it a name … and thus my own contribution to the omics word game began).  Alas somehow the title of our paper became “Gatrogenomic delights” a movable feast” since we were writing about the E. coli and H. pylori genomes, so I added yet another omics term at the same time.  In the paper I showed how phylogenetic analysis of the MutS family of proteins could help in interpreting one of the findings in the H. pylori genome paper:

In this paper we showed why blast searches were not ideal for inferring relationships among sequences (because blast measures similarity NOT evolutionary history per se).  A bit annoyed still that other papers then sort of claimed they were the first to show blast was not ideal for inferring evolutionary relatedness, but whatever. This still did not fully describe the phylogeny driven approach that I was working on so I then wrote up an outline of this approach for a paper in Genome Research: Phylogenomics: Improving Functional Prediction for Uncharacterized Genes by Evolutionary Analysis.  This paper really laid out the idea in more detail:

It also gave detailed examples of how similarity searches could be misleading and how phylogenetic analysis should in principle be better.

I note – I am very very proud of this paper.  But it did not do a lot of things.  Really it was about laying out a concept of using tools from phylogenetics in functional prediction.  But it did not provide software for example.  I later developed some of my own scripts for doing this when I was at TIGR but really the software for phylogeny driven functional predictions would come later from others like Kimmen Sjolander, Sean Eddy, and Steven Brenner.  Each method laid out in these tools and in other papers had its own flavors and I continued to explore various approaches and applications to phylogeny driven functional prediction.  Examples of my subsequent work are listed below (with links to the Mendeley pages for these papers):

Plus we (at TIGR) used phylogenetic analysis as a tool in annotation of many many genomes as well as metagenomes.

Anyway, enough of history for a bit.  What is interesting about this new paper is that they take a slightly different approach to phylogeny driven functional prediction in that they make use of Gene Ontology functional annotations as their key parameter to trace on evolutionary trees.  They lay out the differences in their method quite well in the introduction:

Our general approach is similar to the ‘phylogenomic’ method proposed by Eisen [6] and further developed into a probabilistic form by Engelhardt et al. [7], but with important differences. Eisen proposed a conceptual approach for predicting protein function using a phylogenetic tree together with available experimental knowledge of proteins. The original approach relied on manual curation to identify gene duplication events and to find and assimilate the literature for characterized members of the family. Engelhardt et al. used automated reconciliation with the species tree [8] to identify gene duplication events, and experimental GO terms (MF only) to capture the experimental literature. Using this information, they defined a probabilistic model of evolution of MF involving transitions between different molecular functions.

From these previous studies, we adopt the basic approach of function evolution through a phylogenetic tree and the use of GO annotations to represent function. However, unlike these other phylogenomic methods, we represent the evolution in terms of discrete gain and loss events. In Eisen’s original model, an annotation does not necessarily represent a gain of function (it could have been inherited from an earlier ancestor), and losses are not explicitly annotated. The transition-based model of Engelhardt et al. assumes replacement of one function by another (gain of one function coupled to the loss of another), and does not capture uncoupled events, which is particularly important for BP annotations and cases where a protein has multiple molecular functions (see examples below). In addition, we make no a priori assumptions about conservation of function within versus between orthologous groups, or about the relationship between evolutionary distance and functional conservation (as the distance may not necessarily reflect every given function). While, as described below, gene duplication events and relatively long tree branches are important clues for curators to locate functional divergence (gain and/or loss), in our paradigm an ancestral function can be inherited by both descendants following a duplication (resulting in paralogs with the same function) or gained/lost by one descendant following a speciation event (resulting in orthologs with different functions). Evolution of each function is evaluated on a case-by-case basis, using many different sources of information about a given protein family

I note – Paul Thomas, one of the authors here has also been developing phylogeny driven functional prediction methods for many years and has done some cool things previously.  This new approach seems novel and useful and their paper is worth looking at.  I like too that they focus on MutS homologs for some of their examples:

Anyway – their paper is worth a read and some of their software tools may be of use including PAINT: http://sourceforge.net/projects/pantherdb/ and http://pantree.org

Good to see continuous developments in phylogeny driven functional predictions.  If you want to learn more – check out the Mendeley Group I have created:

http://www.mendeley.com/groups/1190191/_/widget/29/5/

And please contribute to it. Below are some previous posts of mine of possible interest:

Some links on "ortholog conjecture" paper and critiques of it

Recently a paper by Matt Hahn was published in PLoS Computational Biology entitled “Testing the ortholog conjecture with comparative functional genomic data from mammals.”  The paper created a bit of a stir as some aspects of it call into question some of the standard assumptions made in comparative genomic analysis.

I alas do not have time to go into all the details but fortunately others have tackled this and I am posting some links here:

http://friendfeed.com/erickmatsen/f90bd2c6/emergentnexus-i-think-what-you-were-talking?embed=1

Will try to post my own comments soon.  I note – I am skeptical of their conclusions but still going through the paper to understand everything before commenting in more detail.

Playing around with CloVR – cloud computing bioinformatics system

Nice new tool/resource available out there for metagenomic and genomic analysis called CloVR: CloVR: A virtual machine for automated and portable sequence analysis from the desktop using cloud computing
It is available at http://clovr.org and it should be useful to many people out there doing genomics and metagenomics if you want to make use of cloud computing resources.

CloVR is brought to us by Florian Fricke and Owen White and Sam Angiuoli and others from the University of Maryland (full disclosure – many of the authors are ex-colleagues of mine from TIGR).

Not only is Clovr available openly and freely but they even have a Clovr blog: http://clovr.org/category/blog/ … though it does not seem to be heavily used.  Kudos to this team for producing and releasing this software for others to use.  And kudos to NSF, USDA and NIH for funding its development — I have a feeling many people will use it.

I think that I shall never see – metagenomic analysis as lovely as a tree #PhylogenyRules #PLoSOne

ResearchBlogging.org

Figure 2. Phylogenetic tree linking
metagenomic sequences from 31 gene
families  along an oceanic depth gradient
 at the HOT ALOHA site

I am a co-author on a new paper that came out in PLoS One yesterday.  The paper is PLoS ONE: The Phylogenetic Diversity of Metagenomes and the full citation is Kembel SW, Eisen JA, Pollard KS, Green JL (2011) The Phylogenetic Diversity of Metagenomes. PLoS ONE 6(8): e23214. doi:10.1371/journal.pone.0023214.

The first author is Steven Kembel, a brilliant post doc at the University of Oregon.  You can follow him on twitter here. This paper is a product of the “iSEEM” “integrating statistical, ecological and evolutionary approaches to metagenomics” collaboration between my lab and the labs of Jessica Green at U. Oregon and Katie Pollard at UCSF.  For more on iSEEM see http://iseem.org.  iSEEM was supported by the Gordon and Betty Moore Foundation.

Anyway – the paper focuses on developing and using a new method for assessing the phylogenetic diversity of microbes via in samples via analysis of metagenomic data.  Phylogenetic diversity (aka PD) is measured by building evolutionary trees and summing up the total length of branches in such trees.  It is an important diversity metric and is complementary to metrics such as “species richness” which is a measure of the number of species in a sample. When one counts species in a sample, one ends up ignoring the evolutionary distances between species and thus one may get an incomplete picture of the diversity of organisms in a sample simply by counting species.  For example, a sample that contains 500 different species in the genus Escherichia would have the same “richness” as a sample that contained one representative of each of 500 different Orders of bacteria.  For many purposes it is useful to know whether one has a phylogenetically diverse sample or not.  (And of course, if one just focuses on species richness it is also important to not simply ignore some set of organisms in the samples as has sort of been done in a recent paper estimating the total species richness on the planet).  But that is not the point here – the point here is that counting species, even if done correctly, can give an incomplete picture of the diversity of organisms in sample.

For many years researchers have been attempting to measure phylogenetic diversity of various organisms in various samples.  And to do this one needs an evolutionary tree of the organisms in order to then measure branch length in the tree.  There is actually a relatively rich history of researchers attempting to look at PD in studies of microbes – especially in cases where one has access to a rRNA tree for the organisms / samples in question.  Examples of past work on this include:

What we wanted to do here was use metagenomic data to assess phylogenetic diversity of samples.  And in particular we wanted to do this with genes other than rRNA genes (e.g., protein coding genes).  There were multiple challenges in being able to do this (e.g., see a blog post I made about this issue a few years ago asking for community input).  Fortunately, Kembel has worked previously on multiple issues relating to phylogenetic diversity and phylogenetic ecology and his work led to this paper.

I note, as an aside, I have created a Mendeley group focusing on phylogenetic analysis of metagenomes and have added a diversity of papers to the collection:

http://www.mendeley.com/groups/1152921/_/widget/29/2/

In the paper Steve basically started with some of the notions and the code from AMPHORA which was designed by Martin Wu (when he was in my lab).  AMPHORA automatically infers phylogenetic trees of a set of 31 protein coding genes – and it can do this from genomic or metagenomic data. 
AMPHORA was designed to build phylogenetic trees of metagenomic sequences individually – in order to classify reads from samples to infer from what organism they likely came
But that is not what Steven wanted to do here.  What he wanted to do was infer phylogenetic trees from metagenomic samples where ALL the organisms in the sample were included in the same tree.  This was / is challenging for many reasons and this is what I had written the blog post about previously.  One issue we had was the fact that sequences might not overlap with each other and thus including them in a single phylogenetic tree together was complicated.  
From my earlier post:
The challenge with this is really two things. First, we want to analyze just the reads themselves (i.e., we do not want to use assemblies you can make from this type of data). Second, and more importantly, we want to include in our analysis sequence reads that only cover small, not necessarily overlapping regions of the “full length” sequence alignments for the family. 

The alignment would look something like

    sequence 1 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 1 XXXXXXXXX————————-
    sequence 2 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 2 ———XXXXXXXXXXXX————-
    fragment 3 ———————XXXXXXXXXXXXX
    fragment 4 —-XXXXXXXXXXXXXXXXXX————
    sequence 3 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    sequence 4 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    sequence 5 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 5 ———————–XXXXXXXXXXX- 
    where Xs are the regions covered by the sequences/fragments (could be DNA or amino acids)


We want to build trees from these alignments with the hope of using them to learn lots of cool things about the evolution of the fragments and the species from which they come. I can provide more information but really the key part for the phylogenetics here is the nature of the alignment.

In the past, I have decided to constrain my analyses to NOT deal with this type of alignments. I have either analyzed each fragment on its own or we have built a multiple alignment but only inlcuded fragments that cover more than 3/4 of the full length sequence and thus the matrix is much more filled out. Such an alignment would look like this

    sequence 1 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 1 XXXXXXXXXXXXXXXXXXXXXXXXXXX——-
    sequence 2 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 2 –XXXXXXXXXXXXXXXXXXXXXXXX——–
    fragment 3 —–XXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 4 —-XXXXXXXXXXXXXXXXXXXXXXXXXXXX–
    sequence 3 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    sequence 4 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    sequence 5 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 5 –XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX- 

But we really want to include the smaller fragments in our analysis. And we are just not certain how to best do this. We know LOTs of people out there think of similar problems in terms of sparse matrices, supermatrices, supertrees, EST data, etc. And we have ideas about how to do this and are asking around by email some phylogenetics gurus we know. But I thought it might be fun to have the discussion on a blog rather than by email.

So again, how might one best build phylogenetic trees from data that looks like this?

    sequence 1 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 1 XXXXXXXXX————————-
    sequence 2 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 2 ———XXXXXXXXXXXX————-
    fragment 3 ———————XXXXXXXXXXXXX
    fragment 4 —-XXXXXXXXXXXXXXXXXX————
    sequence 3 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    sequence 4 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    sequence 5 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 5 ———————–XXXXXXXXXXX


And from these trees we want to place each fragment relative to (1) the full length sequences and (2) to each other if possible. We also, of course, want branch lengths to reflect some sort of amount of evolution and thus do not just want a cladogram.

So what Steven decided to do in the end was create a method that took all of the AMPHORA markers and concatenated them together into a single mega alignment and then built a reference tree of this mega alignment from available genomes.  Then he searched for matches to any of these genes in metagenomic data and built a tree for each sequence that placed it relative to the reference data.  
Figure 1. Conceptual overview of approach to infer phylogenetic relationships among sequences from metagenomic data sets.
This pipeline allowed him to place many sequences from metagenomic samples onto a single tree such as this one:

Phylogenetic tree linking metagenomic sequences from 31 gene
families along an oceanic depth gradient at the HOT ALOHA site 

And from that he could calculate PD for metagenomic samples.  We then used the PD calculations to comparate and contrast PD with other information in particular from the HOT ALOHA metagenomic data set of Ed Delong, Steve Karl and others.

Figure 3. Taxonomic diversity and standardized
phylogenetic diversity versus depth in environmental
samples along an oceanic depth gradient at the HOT ALOHA site.

For more detail on what we did from there on – read the paper.  It is open access so all can see it / download it / play with it / whatever.  But rather than blather on and on as usual I thought I would email Steve some questions and then post his answers.  These are below:

Can you provide any background to how this work got started and why you ended up doing it?

This work got started as a collaboration between the Eisen, Green, and Pollard labs as part of the iSEEM project (“Integrating Statistical Evolutionary & Ecological Approaches to Metagenomics”), which was funded by the Moore Foundation to figure out ways to address ecological and evolutionary questions using metagenomic data. I had a background in using phylogenetic and evolutionary information to understand ecological communities, and one of the things I wanted to do at iSEEM was to try to think about ways that we could apply methods from ecophylogenetics or phylogenetic community ecology to metagenomic data sets. In conversations among the co-authors, we realized that if we could build phylogenetic hypotheses for organisms based on metagenomic data, we could apply a huge body of ecological and evolutionary theory and use these data sets to improve our understanding of microbial communities and their dynamics.

2. How did you end up working on microbes with your background in larger organisms?

The transition from working on macro-organisms to working on microbes actually wasn’t that big of a leap, since my research has generally been question driven rather than study-system or study-organism driven. My previous research involved using phylogenetic information to better understand community assembly in plants and animals. The increasing availability of phylogenetic information for entire communities of plants and animals drove the development of the field of ‘ecophylogenetics’, and it always seeemed to me that microbes would be the ideal system for this type of approach due to the greater availability of sequence data and phylogenetic information for microbes. Also, the development of high-throughput  sequencing methods meant that the size of microbial community data sets would quickly become really, really large… the prospect of working on data sets with hundreds of millions of observations was really exciting. As my first postdoc was wrapping up, I collaborated on a study looking at phylogenetic diversity of the rhizobacterial symbionts of plant roots that got me interested in microbial ecology. Right around that time I came across the opportunity to work on the iSEEM project, so it seemed like the perfect opportunity to try a new study system.

Having studied the community ecology of both micro- and macro-organisms, I find it interesting that the fields of microbial and non-microbial phylogenetic community ecology have been fairly insulated from one another until recently. For example, the two fields independently developed phylogenetic approaches to community ecology, each field having its own set of favored statistical methods and software packages, with almost no cross-citation, despite addressing very similar questions. In microbiology the emphasis on phylogenetic diversity measures seems to have been driven by the empirical difficulty of defining microbial ‘species’ and other taxonomic units that macro-organismal ecologists are comfortable with, as well as the availability of phylogenetic and sequence data for microbes. Conversely, for macroorganisms the field of ecophylogenetics was driven by a desire to apply a large body of theory on the links between ecological and evolutionary dynamics to empirical data sets, but was relatively data poor in terms of phylogenetic information about individual species.

3. What was the biggest challenge in this work?

For me the biggest challenge was convincing myself and others that we could infer anything about organismal phylogenies from metagenomic data.  People had built phylogenies for individual genes from metagenomic data sets, but there was a lot of skepticism about how and whether it would be possible to infer a phylogeny for multiple genes given the short, non-overlapping nature of metagenomic sequences. A post on your blog provided a lot of useful feedback. In the end this challenge was overcome both through the availability of software packages for placement of short sequences onto reference phylogenies, as well as simulation and bootstrap analyses to make sure that the results we were finding were robust.

4. Any additional things left out of the paper that you would like to mention here? Other acknowledgements?  Annoyances?

There were a number of people involved in the iSEEM project, including Samantha Risenfeld and Aaron Darling, who did simulations that were very helpful in figuring out when and whether we could make inferences about phylogenetic relationships among metagenomic reads.

Our paper makes use of a large number of open-source software packages and I’d like to thank the people who made their code available for re-use in this way. In particular the short sequence placement methods implemented in packages like RAxML and pplacer made this study possible.

5. What (in general) are your current and future plans?

Right now I’m working at the Biology & the Built Environment Center on a number of projects studying the phylogenetic and functional diversity of microbes in indoor environments, trying to understand the interaction between architectural design and microbial diversity indoors, and the role indoor microbes play in human health and well being. I am still interseted in plant biology, and I have an ongoing project looking at the diversity and function of microbial communities on plant leaves (the ‘phyllosphere’) in tropical and temperate forests.

Kembel, S., Eisen, J., Pollard, K., & Green, J. (2011). The Phylogenetic Diversity of Metagenomes PLoS ONE, 6 (8) DOI: 10.1371/journal.pone.0023214

Mendelspod interview with, well, me, discussing science, pranks, evolution and more

Here is an interview with me produced by Mendelspod. It was filmed in my office at UC Davis on Mendel’s birthday.
 


Stalking the Fourth Domain with Jonathan Eisen, Ph D from mendelspod on Vimeo.

Some links of relevance to this interview:

Guardians of microbial diversity: some follow up links re species counting

Just some follow up links regarding counting species which I wrote about recently:

Oh and I am working on some T-shirts if you want to advertise your microbial loyalty

Structures, structures, and structures: Structural genomics of infectious disease drug targets: the SSGCID

Cool new data set available for those studying infectious disease.  See the paper: Structural genomics of infectious disease drug targets: the SSGCID

This is a large scale effort to determine crystal structure of importance for infectious disease studies.  And there is a whole series of papers coming out in “Acta Crystallographica Section F Structural Biology and Crystallization Communications

Plus all the papers about these structures are freely available (though I note – they refer to these as “Open Access” but really they mean “Available at no charge”

Examples of papers include:

And many more.  

Also see their press release here.