New paper: MicrobeDB: a locally maintainable database of microbial genomic sequences

New paper out involving the lab.  The lead author is Morgan Langille, who was a post doc in the lab and is now at Dalhausie Dalhousie University.  The paper describes a tool for creation and maintenance of a local genome sequence database. See MicrobeDB: a locally maintainable database of microbial genomic sequences.

Software is available at


New #openaccess paper in G3 from my lab w/ many others on ‘Programmed DNA elimination in Tetrahymena’ #CiliatesRule

A new paper in which the lab was involved has been published recently (just found it though it is not in Pubmed yet): Genome-Scale Analysis of Programmed DNA Elimination Sites in Tetrahymena thermophila.  It was a collaboration between Kathy Collins, multiple Tetrahymena researchers, the Eisen lab, and the UC Davis Genome Center Bioinformatics core (Joseph Fass and Dawei Lin).  The paper is in G3, an open access journal from the Genetics Society. 

This stems from the project I coordinated on the sequencing and analysis of the macronuclear genome of the single-celled ciliate Tetrahymena thermophila.  This organism, like other ciliates, has two nuclei – one called the micronucleus and one called the micronucleus macronucleus.  In essence you can view the micronucleus as the germ line for this single-celled creature and the micronucleus macronucleus is akin to somatic cells.  The micronucleus is reserved mostly for reproduction.  And the micronucleus macronucleus is used for gene expression.  In sexual reproduction, haploid versions of the micronuclear genomes from two lineages merge together just like in sexual reproduction for other eukaryotes.  After sex the offspring then create a macronuclear genome by taking the micronuclear genome and processing it in a variety of ways – going from 5 chromosomes for example to hundreds.  Plus many regions of the micronuclear genome are “spliced” out and never make it into the macronuclear genome.  Our new paper focuses on trying to better characterize which regions of the micronuclear genome get eliminated.

For more on our past work on Tetrhymena genomics see here which includes links to much more information including to my 2006 blog post about our first paper on the project. 

Note – the work in my lab on the sequencing was supported by grants from NSF and NIH-NIGMS.

New paper from lab: Genome-Scale Analysis of Programmed DNA Elimination Sites in Tetrahymena thermophila

A new paper in which the lab was involved has been published recently (just found it though it is not in Pubmed yet): Genome-Scale Analysis of Programmed DNA Elimination Sites in Tetrahymena thermophil a.  It was a collaboration between Kathy Collins, multiple Tetrahymena researchers, the Eisen lab, and the UC Davis Genome Center Bioinformatics core.  The paper is in G3, an open access journal from the Genetics Society.

New publication from members of my lab (e.g., @ryneches) & lab of Marc Facciotti on ChIP-seq based mapping of archaeal transcription factors

New publication from members of my lab and the lab of Marc Facciotti on a workflow for ChIP-seq based mapping of archaeal transcription factors. The paper includes a description of new software from Russell Neches in my lab called pique for peak calling.

See: A workflow for genome-wide mapping of archaeal transcription factors with ChIP-seq

Russell’s pique software is available on github here:
The Pique software package processes ChIP-seq coverage data to predict protein-binding sites. Strand-specific coverage data are output as tracks for the Gaggle Genome Browser, and putative-binding sites (peaks) are output as ‘bookmark files’. (A) Screenshot of data browsing in the Gaggle Genome Browser. Green box outlines the navigation window for clicking through bookmarks of predicted binding sites. Details of each site can be displayed (inset). The Gaggle toolbar (shown with black arrow) can be used to broadcast selected data to other ‘geese’ in the gaggle package, programs such as R, cytoscape, BLAST or KEGG. (B) Schematic overview of bioinformatics workflow. Wilbanks, E., Larsen, D., Neches, R., Yao, A., Wu, C., Kjolby, R., & Facciotti, M. (2012). A workflow for genome-wide mapping of archaeal transcription factors with ChIP-seq Nucleic Acids Research DOI: 10.1093/nar/gks063

New paper: PLoS ONE: Accounting For Alignment Uncertainty in Phylogenomics

A new paper is out from the lab: PLoS ONE: Accounting For Alignment Uncertainty in Phylogenomics.

It describes the “Zorro” software for automated alignment masking.


Uncertainty in multiple sequence alignments has a large impact on phylogenetic analyses. Little has been done to evaluate the quality of individual positions in protein sequence alignments, which directly impact the accuracy of phylogenetic trees. Here we describe ZORRO, a probabilistic masking program that accounts for alignment uncertainty by assigning confidence scores to each alignment position. Using the BALIBASE database and in simulation studies, we demonstrate that masking by ZORRO significantly reduces the alignment uncertainty and improves the tree accuracy.

This work was supported by the Gordon and Betty Moore Foundation.

Assemblathon 1 paper out, includes many #UCDavis folks, though @vsbuffalo name backwards

Quick one here. A new paper is out from many folks, including Aaron Darling from my lab as well as a few other UC Davis folks. It is a cool paper: Assemblathon 1: A competitive assessment of de novo short read assembly methods
One minor issue – seems they got Vince Buffalo‘s name backwards – he is listed as Buffalo Vince on the Genome Research page, though in the PDF they have his name correct. Will have to see what he has to say about that.

New publication from the lab: Assemblathon 1: A competitive assessment of de novo short read assembly methods

Aaron Darling from the lab is an author on a new paper just published: Assemblathon 1: A competitive assessment of de novo short read assembly methods.

More on ‘phylogenomics’ – as in functional prediction w/ phylogeny

There is a new paper out: Phylogenetic-based propagation of functional annotations within the Gene Ontology consortium in Briefings in Bioinformatics.

The paper is interesting and presents a new general approach to using phylogeny for functional prediction of uncharacterized genes. I am interested in this for many reasons including that I was one of, if not the first to lay this out as a concept.  In a series of papers from 1995-1998 I outlined how phylogenetic analysis could be used to aid in functional prediction for all the genes that were starting to be sequenced in genome projects without any associated functional studies (at the time, I referred to all these ESTs and other sequences as an “onslaught” – little did I know what was to come).

My first paper on this topic was in 1995: Evolution of the SNF2 family of proteins: subfamilies with distinct sequences and functions.  The abstract is below:

The SNF2 family of proteins includes representatives from a variety of species with roles in cellular processes such as transcriptional regulation (e.g. MOT1, SNF2 and BRM), maintenance of chromosome stability during mitosis (e.g. lodestar) and various aspects of processing of DNA damage, including nucleotide excision repair (e.g. RAD16 and ERCC6), recombinational pathways (e.g. RAD54) and post-replication daughter strand gap repair (e.g. RAD5). This family also includes many proteins with no known function. To better characterize this family of proteins we have used molecular phylogenetic techniques to infer evolutionary relationships among the family members. We have divided the SNF2 family into multiple subfamilies, each of which represents what we propose to be a functionally and evolutionarily distinct group. We have then used the subfamily structure to predict the functions of some of the uncharacterized proteins in the SNF2 family. We discuss possible implications of this evolutionary analysis on the general properties and evolution of the SNF2 family.

I note – I am annoyed that when I went to the Nucleic Acids Research site for my paper I discovered for some bizarre reason they are now trying to charge for access to it even though it is in Pubmed Central and used to be freely available on the NAR site.  WTF?  Is this just an IT issue like the #OpenGate complaints I made for a while about Nature Genome papers.

Anyway – in that paper in 1995 I basically showed that at least for this family, phylogenetic analysis could be used as a tool in making functional predictions by allowing one to better identify orthology relationships and subfamilies within the SNF2 superfamily.  This was novel I think maybe a little bit but others at the time were also looking into using various analyses to identify orthology relationships across genomes.

Shortly thereafter I started working on the concept that one could used the phylogenetic tree more explicitly in making functional predictions and eventually I laid out the concept of treating function as a character states and doing character state reconstruction using a gene tree to then infer functions for uncharacterized genes.  I called this approach “phylogenomics” in a paper in 1997 in Nature Medicine (the editor asked us to give it a name … and thus my own contribution to the omics word game began).  Alas somehow the title of our paper became “Gatrogenomic delights” a movable feast” since we were writing about the E. coli and H. pylori genomes, so I added yet another omics term at the same time.  In the paper I showed how phylogenetic analysis of the MutS family of proteins could help in interpreting one of the findings in the H. pylori genome paper:

In this paper we showed why blast searches were not ideal for inferring relationships among sequences (because blast measures similarity NOT evolutionary history per se).  A bit annoyed still that other papers then sort of claimed they were the first to show blast was not ideal for inferring evolutionary relatedness, but whatever. This still did not fully describe the phylogeny driven approach that I was working on so I then wrote up an outline of this approach for a paper in Genome Research: Phylogenomics: Improving Functional Prediction for Uncharacterized Genes by Evolutionary Analysis.  This paper really laid out the idea in more detail:

It also gave detailed examples of how similarity searches could be misleading and how phylogenetic analysis should in principle be better.

I note – I am very very proud of this paper.  But it did not do a lot of things.  Really it was about laying out a concept of using tools from phylogenetics in functional prediction.  But it did not provide software for example.  I later developed some of my own scripts for doing this when I was at TIGR but really the software for phylogeny driven functional predictions would come later from others like Kimmen Sjolander, Sean Eddy, and Steven Brenner.  Each method laid out in these tools and in other papers had its own flavors and I continued to explore various approaches and applications to phylogeny driven functional prediction.  Examples of my subsequent work are listed below (with links to the Mendeley pages for these papers):

Plus we (at TIGR) used phylogenetic analysis as a tool in annotation of many many genomes as well as metagenomes.

Anyway, enough of history for a bit.  What is interesting about this new paper is that they take a slightly different approach to phylogeny driven functional prediction in that they make use of Gene Ontology functional annotations as their key parameter to trace on evolutionary trees.  They lay out the differences in their method quite well in the introduction:

Our general approach is similar to the ‘phylogenomic’ method proposed by Eisen [6] and further developed into a probabilistic form by Engelhardt et al. [7], but with important differences. Eisen proposed a conceptual approach for predicting protein function using a phylogenetic tree together with available experimental knowledge of proteins. The original approach relied on manual curation to identify gene duplication events and to find and assimilate the literature for characterized members of the family. Engelhardt et al. used automated reconciliation with the species tree [8] to identify gene duplication events, and experimental GO terms (MF only) to capture the experimental literature. Using this information, they defined a probabilistic model of evolution of MF involving transitions between different molecular functions.

From these previous studies, we adopt the basic approach of function evolution through a phylogenetic tree and the use of GO annotations to represent function. However, unlike these other phylogenomic methods, we represent the evolution in terms of discrete gain and loss events. In Eisen’s original model, an annotation does not necessarily represent a gain of function (it could have been inherited from an earlier ancestor), and losses are not explicitly annotated. The transition-based model of Engelhardt et al. assumes replacement of one function by another (gain of one function coupled to the loss of another), and does not capture uncoupled events, which is particularly important for BP annotations and cases where a protein has multiple molecular functions (see examples below). In addition, we make no a priori assumptions about conservation of function within versus between orthologous groups, or about the relationship between evolutionary distance and functional conservation (as the distance may not necessarily reflect every given function). While, as described below, gene duplication events and relatively long tree branches are important clues for curators to locate functional divergence (gain and/or loss), in our paradigm an ancestral function can be inherited by both descendants following a duplication (resulting in paralogs with the same function) or gained/lost by one descendant following a speciation event (resulting in orthologs with different functions). Evolution of each function is evaluated on a case-by-case basis, using many different sources of information about a given protein family

I note – Paul Thomas, one of the authors here has also been developing phylogeny driven functional prediction methods for many years and has done some cool things previously.  This new approach seems novel and useful and their paper is worth looking at.  I like too that they focus on MutS homologs for some of their examples:

Anyway – their paper is worth a read and some of their software tools may be of use including PAINT: and

Good to see continuous developments in phylogeny driven functional predictions.  If you want to learn more – check out the Mendeley Group I have created:

And please contribute to it. Below are some previous posts of mine of possible interest:

My first PLoS One paper …. yay: automated phylogenetic tree based rRNA analysis
Well, I have truly entered the modern world. My first PLoS One paper has just come out. It is entitled “An Automated Phylogenetic Tree-Based Small Subunit rRNA Taxonomy and Alignment Pipeline (STAP)” and well, it describes automated software for analyzing rRNA sequences that are generated as part of microbial diversity studies. The main goal behind this was to keep up with the massive amounts of rRNA sequences we and others could generate in the lab and to develop a tool that would remove the need for “manual” work in analyzing rRNAs.

The work was done primarily by Dongying Wu, a Project Scientist in my lab with assistance from a Amber Hartman, who is a PhD student in my lab. Naomi Ward, who was on the faculty at TIGR and is now at Wyoming, and I helped guide the development and testing of the software.

We first developed this pipeline/software in conjunction with analyzing the rRNA sequences that were part of the Sargasso Sea metagenome and results from the word was in the Venter et al. Sargasso paper. We then used the pipeline and continued to refine it as part of a variety of studies including a paper by Kevin Penn et al on coral associated microbes. Kevin was working as a technician for me and Naomi and is now a PhD student at Scripps Institute of Oceanography. We also had some input from various scientists we were working with on rRNA analyses, especially Jen Hughes Martiny

We made a series of further refinements and worked with people like Saul Kravitz from the Venter Institute and the CAMERA metagenomics database to make sure that the software could be run outside of my lab. And then we finally got around to writing up a paper …. and now it is out.

You can download the software here. The basics of the software are summarized below: (see flow chart too).

  • Stage 1: Domain Analysis
    • Take a rRNA sequence
    • blast it against a database of representative rRNAs from all lines of life
    • use the blast results to help choose sequences to use to make a multiple sequence alignment
    • infer a phylogenetic tree from the alignment
    • assign the sequence to a domain of life (bacteria, archaea, eukaryotes)

  • Stage 2: First pass alignment and tree within domain
    • take the same rRNA sequence
    • blast against a database of rRNAs from within the domain of interest
    • use the blast results to help choose sequences for a multiple alignment
    • infer a phylogenetic tree from the alignment
    • assign the sequence to a taxonomic group

  • Stage 3: Second pass alignment and tree within domain
    • extract sequences from members of the putative taxonomic group (as well as some others to balance the diversity)
    • make a multiple sequence alignment
    • infer a phylogenetic tree

From the above path, we end up with an alignment, which is useful for things such as counting number of species in a sample as well as a tree which is useful for determining what types of organisms are in the sample.

I note – the key is that it is completely automated and can be run on a single machine or a cluster and produces comparable results to manual methods. In the long run we plan to connect this to other software and other labs develop to build a metagenomics and microbial diversity workflow that will help in the processing of massive amounts of sequence data for microbial diversity studies.

I should note this work was supported primarily by a National Science Foundation grant to me and Naomi Ward as part of their “Assembling the Tree of Life” Program (Grant No. 0228651). Some final work on the project was funded by the Gordon and Betty Moore Foundation through grant #1660 to Jonathan Eisen and the CAMERA grant to UCSD.

Wu, D., Hartman, A., Ward, N., & Eisen, J. (2008). An Automated Phylogenetic Tree-Based Small Subunit rRNA Taxonomy and Alignment Pipeline (STAP) PLoS ONE, 3 (7) DOI: 10.1371/journal.pone.0002566