Announcing CAFA 2: The Second Critical Assessment of Protein Function Annotations

Just received this from Iddo Friedberg:

Friends and Colleagues,

We are pleased to announce the Second Critical Assessment of protein
Function Annotation (CAFA) challenge. In CAFA 2, we would like to
evaluate the performance of protein function prediction tools/methods
(in old and new scenarios) and also expand the challenge to include
prediction of human phenotypes associated with genes and gene
products. As the last time, CAFA will be a part of the Automated
Function Prediction Special Interest Group (AFP-SIG) meeting that will
be held alongside the ISMB conference. AFP-SIG will be held as a
two-day meeting in July 2014 in Boston.

The targets and all information about the CAFA challenge are now
available at http://biofunctionprediction.org. The submission deadline
for predictions is January 15, 2014. The initial evaluation will be
done during the AFP-SIG meeting in Boston. Anyone in the world is
welcome to participate.

The mission of the Automated Function Prediction Special Interest
Group (AFP-SIG) is to bring together computational biologists who are
dealing with the important problem of gene and gene product function
prediction, to share ideas and create collaborations. We also aim to
facilitate interactions with experimental biologists and biocurators.
We hope that AFP-SIG serves an important role in stimulating research
in annotation of biological macromolecules, but also related fields.

About the CAFA experiment

The problem: There are far too many proteins for which the sequence is
known, but the function is not. The gap between what we know and what
we do not know is growing. A major challenge in the field of
bioinformatics is to predict the function of a protein from its
sequence (and all other data one can find). At the same time, how can
we judge how well these function prediction algorithms are performing
and whether we are making progress over time?

The solution: The Critical Assessment of protein Function Annotation
algorithms (CAFA) is an experiment designed to provide a large-scale
assessment of computational methods dedicated to predicting protein
function. We will evaluate methods in predicting the Gene Ontology
(GO) terms in the categories of Molecular Function, Biological
Process, and Cellular Component. In addition, predictors may use the
Human Phenotype Ontology (HPO) for the human dataset. A set of protein
sequences is provided by the organizers, and participants are expected
to submit their predictions by the submission deadline, January 15,
2014. The predictions will be evaluated during the Automated Function
Prediction (AFP) meeting, which has been approved as a Special
Interest Group (SIG) meeting, at the ISMB 2014 conference (Boston,
USA).

History: The first CAFA experiment was conducted in 2010-2011.
Twenty-three groups submitted fifty-four algorithms for assessment.
The results and most methods were published in Nature Methods and in a
special supplement in BMC Bioinformatics. CAFA 1 has brought together
a large group of computational predictors and, for the first time,
provided us with a clear picture of the state of this important field.
As with other critical assessment experiments, the aim of CAFA is
improve protein function prediction by continuously challenging groups
to develop more accurate methods.

How to participate in CAFA 2?
1. Download target proteins, available August 27, 2013
2. Submit predictions on or before January 15, 2014
3. Join us at the AFP-SIG, July 11-12, 2014 in Boston for the eighth
protein function prediction meeting, to hear the CAFA 2 results, to
present your work, and to learn about the latest research in
computational protein function prediction

More details at: http://biofunctionprediction.org

Confirmed keynote speakers:
Fiona Brinkman, Simon-Fraser University, Canada
Mark Gerstein, Yale University, USA

Looking forward to hearing from you!
The CAFA organizing Team: Predrag Radivojac, Michal Linial, Sean
Mooney and Iddo Friedberg

Contact: CAFA.2014@gmail.com

More on ‘phylogenomics’ – as in functional prediction w/ phylogeny

There is a new paper out: Phylogenetic-based propagation of functional annotations within the Gene Ontology consortium in Briefings in Bioinformatics.

The paper is interesting and presents a new general approach to using phylogeny for functional prediction of uncharacterized genes. I am interested in this for many reasons including that I was one of, if not the first to lay this out as a concept.  In a series of papers from 1995-1998 I outlined how phylogenetic analysis could be used to aid in functional prediction for all the genes that were starting to be sequenced in genome projects without any associated functional studies (at the time, I referred to all these ESTs and other sequences as an “onslaught” – little did I know what was to come).

My first paper on this topic was in 1995: Evolution of the SNF2 family of proteins: subfamilies with distinct sequences and functions.  The abstract is below:

The SNF2 family of proteins includes representatives from a variety of species with roles in cellular processes such as transcriptional regulation (e.g. MOT1, SNF2 and BRM), maintenance of chromosome stability during mitosis (e.g. lodestar) and various aspects of processing of DNA damage, including nucleotide excision repair (e.g. RAD16 and ERCC6), recombinational pathways (e.g. RAD54) and post-replication daughter strand gap repair (e.g. RAD5). This family also includes many proteins with no known function. To better characterize this family of proteins we have used molecular phylogenetic techniques to infer evolutionary relationships among the family members. We have divided the SNF2 family into multiple subfamilies, each of which represents what we propose to be a functionally and evolutionarily distinct group. We have then used the subfamily structure to predict the functions of some of the uncharacterized proteins in the SNF2 family. We discuss possible implications of this evolutionary analysis on the general properties and evolution of the SNF2 family.



I note – I am annoyed that when I went to the Nucleic Acids Research site for my paper I discovered for some bizarre reason they are now trying to charge for access to it even though it is in Pubmed Central and used to be freely available on the NAR site.  WTF?  Is this just an IT issue like the #OpenGate complaints I made for a while about Nature Genome papers.

Anyway – in that paper in 1995 I basically showed that at least for this family, phylogenetic analysis could be used as a tool in making functional predictions by allowing one to better identify orthology relationships and subfamilies within the SNF2 superfamily.  This was novel I think maybe a little bit but others at the time were also looking into using various analyses to identify orthology relationships across genomes.

Shortly thereafter I started working on the concept that one could used the phylogenetic tree more explicitly in making functional predictions and eventually I laid out the concept of treating function as a character states and doing character state reconstruction using a gene tree to then infer functions for uncharacterized genes.  I called this approach “phylogenomics” in a paper in 1997 in Nature Medicine (the editor asked us to give it a name … and thus my own contribution to the omics word game began).  Alas somehow the title of our paper became “Gatrogenomic delights” a movable feast” since we were writing about the E. coli and H. pylori genomes, so I added yet another omics term at the same time.  In the paper I showed how phylogenetic analysis of the MutS family of proteins could help in interpreting one of the findings in the H. pylori genome paper:

In this paper we showed why blast searches were not ideal for inferring relationships among sequences (because blast measures similarity NOT evolutionary history per se).  A bit annoyed still that other papers then sort of claimed they were the first to show blast was not ideal for inferring evolutionary relatedness, but whatever. This still did not fully describe the phylogeny driven approach that I was working on so I then wrote up an outline of this approach for a paper in Genome Research: Phylogenomics: Improving Functional Prediction for Uncharacterized Genes by Evolutionary Analysis.  This paper really laid out the idea in more detail:

It also gave detailed examples of how similarity searches could be misleading and how phylogenetic analysis should in principle be better.

I note – I am very very proud of this paper.  But it did not do a lot of things.  Really it was about laying out a concept of using tools from phylogenetics in functional prediction.  But it did not provide software for example.  I later developed some of my own scripts for doing this when I was at TIGR but really the software for phylogeny driven functional predictions would come later from others like Kimmen Sjolander, Sean Eddy, and Steven Brenner.  Each method laid out in these tools and in other papers had its own flavors and I continued to explore various approaches and applications to phylogeny driven functional prediction.  Examples of my subsequent work are listed below (with links to the Mendeley pages for these papers):

Plus we (at TIGR) used phylogenetic analysis as a tool in annotation of many many genomes as well as metagenomes.

Anyway, enough of history for a bit.  What is interesting about this new paper is that they take a slightly different approach to phylogeny driven functional prediction in that they make use of Gene Ontology functional annotations as their key parameter to trace on evolutionary trees.  They lay out the differences in their method quite well in the introduction:

Our general approach is similar to the ‘phylogenomic’ method proposed by Eisen [6] and further developed into a probabilistic form by Engelhardt et al. [7], but with important differences. Eisen proposed a conceptual approach for predicting protein function using a phylogenetic tree together with available experimental knowledge of proteins. The original approach relied on manual curation to identify gene duplication events and to find and assimilate the literature for characterized members of the family. Engelhardt et al. used automated reconciliation with the species tree [8] to identify gene duplication events, and experimental GO terms (MF only) to capture the experimental literature. Using this information, they defined a probabilistic model of evolution of MF involving transitions between different molecular functions.

From these previous studies, we adopt the basic approach of function evolution through a phylogenetic tree and the use of GO annotations to represent function. However, unlike these other phylogenomic methods, we represent the evolution in terms of discrete gain and loss events. In Eisen’s original model, an annotation does not necessarily represent a gain of function (it could have been inherited from an earlier ancestor), and losses are not explicitly annotated. The transition-based model of Engelhardt et al. assumes replacement of one function by another (gain of one function coupled to the loss of another), and does not capture uncoupled events, which is particularly important for BP annotations and cases where a protein has multiple molecular functions (see examples below). In addition, we make no a priori assumptions about conservation of function within versus between orthologous groups, or about the relationship between evolutionary distance and functional conservation (as the distance may not necessarily reflect every given function). While, as described below, gene duplication events and relatively long tree branches are important clues for curators to locate functional divergence (gain and/or loss), in our paradigm an ancestral function can be inherited by both descendants following a duplication (resulting in paralogs with the same function) or gained/lost by one descendant following a speciation event (resulting in orthologs with different functions). Evolution of each function is evaluated on a case-by-case basis, using many different sources of information about a given protein family

I note – Paul Thomas, one of the authors here has also been developing phylogeny driven functional prediction methods for many years and has done some cool things previously.  This new approach seems novel and useful and their paper is worth looking at.  I like too that they focus on MutS homologs for some of their examples:

Anyway – their paper is worth a read and some of their software tools may be of use including PAINT: http://sourceforge.net/projects/pantherdb/ and http://pantree.org

Good to see continuous developments in phylogeny driven functional predictions.  If you want to learn more – check out the Mendeley Group I have created:

http://www.mendeley.com/groups/1190191/_/widget/29/5/

And please contribute to it. Below are some previous posts of mine of possible interest:

Some links on "ortholog conjecture" paper and critiques of it

Recently a paper by Matt Hahn was published in PLoS Computational Biology entitled “Testing the ortholog conjecture with comparative functional genomic data from mammals.”  The paper created a bit of a stir as some aspects of it call into question some of the standard assumptions made in comparative genomic analysis.

I alas do not have time to go into all the details but fortunately others have tackled this and I am posting some links here:

http://friendfeed.com/erickmatsen/f90bd2c6/emergentnexus-i-think-what-you-were-talking?embed=1

Will try to post my own comments soon.  I note – I am skeptical of their conclusions but still going through the paper to understand everything before commenting in more detail.

Testing, testing – why we need more testing like this in genomic informatics & annotation methods

Just got an announcement regarding this challenge:

Automated Function Prediction SIG 2011 featuring the CAFA Challenge: Critical Assessment of Function Annotations | Automated Function Prediction 2011 July 15-16 2011, Vienna, Austria

Here is a description:

CAFA is a community-driven effort. We call upon computational function prediction groups to predict the function of a set of proteins whose true function is sequestered. At the meeting, we will reveal the functions, and discuss the predictions. The CAFA challenge goals are to foster a discussion between annotators, predictors and experimentalists about methodology as quality of functional predictions, as well as the methodology of assessing those predictions. Registration for CAFA starts July 15, 2010 and the CAFA challenge will take place September 15, 2010 through January 15, 2011.See here for more details on how you can enroll in CAFA.

This is near and dear to my heart as I have been working on methods to predict gene function from sequence for some 15 years now.  My first paper on this was in 1995 in which I showed that for genes in multigene families, phylogenetic trees of the gene family could help in predicting functions of uncharacterized members of the gene family.  More specifically, I suggested that the position of an uncharacterized gene in a gene tree relative to characterized genes could be used to predict its function.  I did this for one family in particular – the SNF2 family – but argued that it could be applied to other families.  (I think perhaps it was the first time someone had made this specific argument about using trees to predict function, but am not sure)

I then formalized this idea with a few papers (e.g., here and here) describing a “phylogenomic” approach to predicting function (alas, this is when I invented my first omics word).  And for many years since, I continued to work on functional prediction methods and continue to do so.  When I was at TIGR for eight years I did this both in my own research and helped others with their functional predictions.  I firmly believe that evolutionary approached approaches are critical in such functional prediction and have laid this out in a series of talks and papers (e.g., see this more recent one).

Anyway, enough about me.  I can argue all I want about how brilliant I am and about how evolutionary methods are the best approach.  But arguing is alas not science.  What we need are tests and experiments.  And that is where things like CAFA come in.  In CAFA one can test how well various functional prediction methods work.  And the people involved in CAFA (including organizers  Iddo FriedbergMichal Linial, and Predrag Radivojac and others such as Amos Bairoch, Sean Mooney, Patricia Babbitt, Steven Brenner, Christine Orengo and Burkhard RoshRost)) are to be commended for putting this together because we do not have a lot of these activities and need more in all aspects of genomics (and metagenomics too).  Others have discussed doing tests of functional prediction methods before, but I am not sure if any have happened per se.

Have a favorite functional prediction method?  Enter it in the competition or give a talk on it.  And if you are feeling inspired, organize a similar activity in your area of science – testing is a good thing.

See also Iddo Friedberg’s post about this