Machine learning methods for protein function and structure prediction
Kimmen Sjölander
Associate Professor
Berkeley Phylogenomics Group
University of California, Berkeley
http://phylogenomics.berkeley.edu
May 1, 2012, 12:00 p.m.
1005 GBSF Auditorium
Abstract: Theodosius Dobzhansky, the noted geneticist and evolutionary biologist, is famous for having said “Nothing makes sense except in the light of evolution.” In this talk, I will discuss the explicit use of evolution as a fundamental principle in bioinformatics, using machine learning methods in combination with information from protein structure and evolution to improve the power and specificity of a number of bioinformatics tasks, including prediction of protein structure and function, ortholog identification, functional site prediction, and simultaneous estimation of multiple sequence alignments and protein superfamily phylogenies. Because many of these methods require expertise and/or computational resources not available to most experimental biologists, we provide pre-calculated
phylogenetic trees for gene families in the PhyloFacts database. PhyloFacts 3.0 is a phylogenomic database of gene families across the Tree of Life. Each PhyloFacts family contains a multiple sequence alignment, phylogenetic tree, predicted orthologs, predicted pathway associations and experimental and other annotation data. As of April 26, 2012, PF 3.0 contains >7.3M protein sequences from >99K unique taxa (including strains) across >92K families.
Finally, I will describe our work on a fully automated system for high-throughput functional annotation of genomes and for taxonomic and functional annotation of metagenome (environmental sample) datasets. This system, which we call FAT-CAT (for Fast Approximate Tree Classification) uses hidden Markov models placed at internal nodes of PhyloFacts trees to classify sequences to different levels of functional hierarchies. Subtree nodes are annotated automatically using data available for sequences descending from those nodes, allowing both functional and taxonomic inference for sequences classified to those nodes. The PhyloFacts Phylogenomic Database is available at http://phylogenomics.berkeley.edu/phylofacts/.