Manuscript preprint now online – Phinch data visualization framework

The preprint for the Phinch software paper is now online (one of my Legacy Eisen Lab projects) Please enjoy the PDF on bioRxiv while we patiently wait for the manuscript to go through the peer review process:

Bik, H.M. and Pitch Interactive (2014) Phinch: An interactive, exploratory data visualization framework for –Omic datasets, bioRxiv, doi: http://dx.doi.org/10.1101/009944

If you’re not familiar with this project – Phinch (http://phinch.org) is an open-source framework for visualizing biological data, funded by a grant from the Alfred P. Sloan foundation. This project has been an interdisciplinary collaboration between researchers (driven by myself at UC Davis) and Pitch Interactive, a data visualization studio in Oakland, CA. If you’re interested in loading up some data in this visualization tool, check out our GitHub wiki for full instructions on preparing files and metadata (if you’re already using the QIIME pipeline, you should be ready to go in ~10 minutes…we tried to make it that easy!)

New “data paper” from the lab on the draft genome of Tatumella sp. isolated from Drosophila suzukii larvae

See: Draft Genome Sequence of Tatumella sp. Strain UCD-D_suzukii (Phylum Proteobacteria) Isolated from Drosophila suzukii Larvae.

Congrats to all involved: Madison Dunitz, Pamela James, Guillaume Jospin, David Coil and Angus Chandler.  Getting data out there to the community and writing up a description of how the data was generated is an important part of open science.

A little bit about PhyloSift: phylogenetic analysis of genomes and metagenomes

New paper from people in the Eisen lab: PhyloSift: phylogenetic analysis of genomes and metagenomes [PeerJ].

Basically, the concept behind Phylosift is to provide for high quality, automated, high throughput phylogeny-driven analysis of metagenomic sequence data.  The software was developed openly on github and has been available in some form for more than a year.  Aaron, Holly, Erick and I have discussed it extensively in various talks around the world and thus we assume some are already familiar with it.

This project was coordinated by Aaron Darling, who was a Project Scientist in my lab and is now a Professor at the University of Technology Sydney.  Also involved were Holly Bik (post doc in the lab), Guillaume Jospin (Bioinformatics Engineer in the lab), Eric Lowe (was a UC Davis undergrad working in the lab) and Erick Matsen (from the FHCRC).

Abstract:

Like all organisms on the planet, environmental microbes are subject to the forces of molecular evolution. Metagenomic sequencing provides a means to access the DNA sequence of uncultured microbes. By combining DNA sequencing of microbial communities with evolutionary modeling and phylogenetic analysis we might obtain new insights into microbiology and also provide a basis for practical tools such as forensic pathogen detection.

In this work we present an approach to leverage phylogenetic analysis of metagenomic sequence data to conduct several types of analysis. First, we present a method to conduct phylogeny-driven Bayesian hypothesis tests for the presence of an organism in a sample. Second, we present a means to compare community structure across a collection of many samples and develop direct associations between the abundance of certain organisms and sample metadata. Third, we apply new tools to analyze the phylogenetic diversity of microbial communities and again demonstrate how this can be associated to sample metadata.

These analyses are implemented in an open source software pipeline called PhyloSift. As a pipeline, PhyloSift incorporates several other programs including LAST, HMMER, and pplacer to automate phylogenetic analysis of protein coding and RNA sequences in metagenomic datasets generated by modern sequencing platforms (e.g., Illumina, 454).

Figure 1 shows the general outline of the workflow.
Figure 1 showing the Phylosift workflow.

The workflow follows a series of steps including

  • Sequence identity search 
  • Alignment to reference multiple alignment 
  • Placement on a phylogenetic reference tree 
  • Visual presentation of taxonomic summary 
  • Comparison among samples (e.g., using Edge PCA)
In addition, there is a workflow for updating the database behind Phylosift which includes

  • Acquiring new genome data 
  • Gene family search and alignment workflow on each genome 
  • Phylogenetic inference and pruning 
  • Selection of representatives for similarity search 
  • Taxonomic reconciliation 

The paper shows some of the things you can do with Phylosift and some comparison of Phylosift and other methods.

Figure 2. Comparison of QIIME PCA and edge PCA analysis of human fecal samples.

Figure 3: Lineages contributing variation in human fecal sample community structure. (Analyzed using EDGE PCA)

It also provides Krona based output visualization of the taxonomic composition of a sample.

Anyway, more on Phylosift later.  Just thought I would get some out here on the blog.  Thanks to Aaron Darling, Holly Bik, Guillaume Jospin, Eric Lowe and Erick Matsen for all their hard work on this.  And thanks to the Department of Homeland Security for supporting the work.

For more about Phylosift see

New EisenLab paper: PhyloSift: phylogenetic analysis of genomes and metagenomes [PeerJ]

New paper from people in the Eisen lab (and some others): PhyloSift: phylogenetic analysis of genomes and metagenomes [PeerJ].  This project was coordinated by Aaron Darling, who was a Project Scientist in my lab and is now a Professor at the University of Technology Sydney.  Also involved were Holly Bik (post doc in the lab), Guillaume Jospin (Bioinformatics Engineer in the lab), Eric Lowe (was a UC Davis undergrad working in the lab) and Erick Matsen (from the FHCRC).  This work was supported by a grant from the Department of Homeland Security.

Abstract:

Like all organisms on the planet, environmental microbes are subject to the forces of molecular evolution. Metagenomic sequencing provides a means to access the DNA sequence of uncultured microbes. By combining DNA sequencing of microbial communities with evolutionary modeling and phylogenetic analysis we might obtain new insights into microbiology and also provide a basis for practical tools such as forensic pathogen detection.

In this work we present an approach to leverage phylogenetic analysis of metagenomic sequence data to conduct several types of analysis. First, we present a method to conduct phylogeny-driven Bayesian hypothesis tests for the presence of an organism in a sample. Second, we present a means to compare community structure across a collection of many samples and develop direct associations between the abundance of certain organisms and sample metadata. Third, we apply new tools to analyze the phylogenetic diversity of microbial communities and again demonstrate how this can be associated to sample metadata.

These analyses are implemented in an open source software pipeline called PhyloSift. As a pipeline, PhyloSift incorporates several other programs including LAST, HMMER, and pplacer to automate phylogenetic analysis of protein coding and RNA sequences in metagenomic datasets generated by modern sequencing platforms (e.g., Illumina, 454).

For more about Phylosift see

New paper from Eisen lab: Genomic Encyclopedia of Type Strains, Phase I: the one thousand microbial genomes KMG-I project

A new paper of possible interest discussing one of the new phases of the GEBA Genomic Encyclopedia of Bacteria and Archaea project. Genomic Encyclopedia of Type Strains, Phase I: the one thousand microbial genomes KMG-I project | Kyrpides | Standards in Genomic Sciences.

 

New paper from some in the Eisen lab: phylogeny driven sequencing of cyanobacteria

Quick post here.  This paper came out a few months ago but it was not freely available so I did not write about it (it is in PNAS but was not published with the PNAS Open Option — not my choice – lead author did not choose that option and I was not really in the loop when that choice was made).

Improving the coverage of the cyanobacterial phylum using diversity-driven genome sequencing. [Proc Natl Acad Sci U S A. 2013] – PubMed – NCBI.

Anyway – it is now in Pubmed Central and at least freely available so I felt OK posting about it now.  It is in a way a follow up to the “A phylogeny driven genomic encyclopedia of bacteria and archaea” paper (AKA GEBA) from 2009 with this paper a zooming in on the cyanobacteria.

 

New papers from people in the lab …

A lot of new papers from people in the lab in the last month or so.  See below for details of some of them:

Nice timing: Our paper on the Darwin’s Finch genome is out today on Darwin’s birthday

Birthday party for Darwin in 2009

Well, I assume this was on purpose from the folks at Biomed Central but not sure.  Our paper on the genome of one of Darwin’s Finches is out today in BMC Genomics: BMC Genomics | Abstract | Insights into the evolution of Darwin’s finches from comparative analysis of the Geospiza magnirostris genome sequence.

Abstract of the paper:

Background
A classical example of repeated speciation coupled with ecological diversification is the evolution of 14 closely related species of Darwin’s (Galápagos) finches (Thraupidae, Passeriformes). Their adaptive radiation in the Galápagos archipelago took place in the last 2–3 million years and some of the molecular mechanisms that led to their diversification are now being elucidated. Here we report evolutionary analyses of genome of the large ground finch, Geospiza magnirostris.
Results
13,291 protein-coding genes were predicted from a 991.0 Mb G. magnirostris genome assembly. We then defined gene orthology relationships and constructed whole genome alignments between the G. magnirostris and other vertebrate genomes. We estimate that 15% of genomic sequence is functionally constrained between G. magnirostris and zebra finch. Genic evolutionary rate comparisons indicate that similar selective pressures acted along the G. magnirostris and zebra finch lineages suggesting that historical effective population size values have been similar in both lineages. 21 otherwise highly conserved genes were identified that each show evidence for positive selection on amino acid changes in the Darwin’s finch lineage. Two of these genes (Igf2r and Pou1f1) have been implicated in beak morphology changes in Darwin’s finches. Five of 47 genes showing evidence of positive selection in early passerine evolution have cilia related functions, and may be examples of adaptively evolving reproductive proteins.
Conclusions
These results provide insights into past evolutionary processes that have shaped G. magnirostris genes and its genome, and provide the necessary foundation upon which to build population genomics resources that will shed light on more contemporaneous adaptive and non-adaptive processes that have contributed to the evolution of the Darwin’s finches.

Figure 1

There is a long long long story behind this paper.  Too long for me to write up right now.  I wrote up some of the story for a Figshare posting of the genome data last year.

“Darwin’s Finches” are a model system for the study of various aspects of evolution and development.  In 2008 we commenced on a project to sequence the genomes of some of these species – inspired by the (then) upcoming celebration of the 200th anniversary of the birth of Charles Darwin (which was in February 2009).  The project started with a brief discussion at the AGBT meeting in 2008 and then via an email conversation between Jonathan Eisen and Jason Affourtit about the possibility of a collaboration involving the 454 company (which was looking for projects to highlight the power of it’s then relatively new 454 sequencing machines).  After further discussions between Jonathan Eisen, his brother Michael Eisen (who separately had become interested in Darwin’s finches) and people from 454 it was decided that this was a potentially good project for a scientific and marketing collaboration.  

In these conversations it was determined that the most likely limiting factor would be access to DNA from the finches.  This was largely an issue due to the fact that the Galapagos Islands (where the finches reside) are a National Park in Ecuador and also a World Heritage site.  Collection of samples there for any type of research is highly regulated.  Thus, Jonathan Eisen made contact with Peter and Rosemary Grant – the most prominent researchers working on the finches – and who Eisen had discussed sequencing the finch genomes in the early 2000s.  In that previous conversation it was determined that the sequencing would be too expensive to carry out without a major fundraising effort.  However, with the advent of “next generation” sequencing methods such as 454 the total costs of such a project would be much lower.   

In the conversations with the Grants, the Grants offered to ask around to see if anyone had sufficient amounts of DNA (or access to samples), which would be needed for genome library construction.  Subsequently they identified Arkhat Abzhanov from Harvard as someone who likely had samples as well as permission to do DNA-based work on them, from many of the finch species. Abzhanov offered to provide samples from three key species (large ground finch Geospiza magnirostris, large cactus finch G. conirostris and sharp-billed finch G. difficilis) and DNA was sent to Roche-454 for sequencing in July of 2008.  In August, the first “test” sequence data was provided from Geospiza magnirostris.  A plan was then made to generate additional data and Roche offered to do the sequencing at their center at a steep discount.  Funds were raised by Jonathan Eisen, Greg Wray, Monica Riley, and others to pay for the sequencing and over the next year or so, three sequencing bursts were conducted at Roche-454. “

That is a decent summary of the background.  The details on the science are in the paper.  What the background does not say is that the project languished for years as we did not have funds to support the actual analysis of the genomes and it was kind of out of my normal area of expertise.  Along the way, I did a poor job of communicating with some of the initial parties in the project (e.g., I did a really bad job of communicating with Greg Wray – who had provide some of the funds – and I will forever be trying to make things up to him).  Anyway, thankfully Arhat eventually pulled together a group of people led by Chris Ponting to help analyze the genome and Chris led the way to the paper that is out today.  Only four years after our original goal.

I have been a birder and an evolutionary biologist for many many many years. Thus this is kind of a cool project for me.  When I was in the Galapagos in 2002 I dreamed of doing a project like this – and even started doodling Darwin’s finches all over the place – including on some of the styrofoam cups we sent down to the bottom of the ocean on the outside of the Alvin sub as part of a deep sea research cruise I went on.  See below:

https://picasaweb.google.com/s/c/bin/slideshow.swf

Add caption

Some related posts:
From 2002

From 2002

Me, in the Galapagos in 2002

Me in the Galapagos in 2002

Email from Biomed Central pointing to ways to get #altmetrics for recent sFAMS paper

Just received from Biomed Central and thought some people might be interested in some of the ways they try to help you gather metrics about your papers.

Dear Prof Eisen,

We thought you might be interested to know how many people have read your article:

Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource
Thomas J Sharpton, Guillaume Jospin, Dongying Wu, Morgan GI Langille, Katherine S Pollard and Jonathan A Eisen
BMC Bioinformatics, 13:264   (13 Oct 2012)
http://www.biomedcentral.com/1471-2105/13/264

Continue reading