New EisenLab paper: PhyloSift: phylogenetic analysis of genomes and metagenomes [PeerJ]

New paper from people in the Eisen lab (and some others): PhyloSift: phylogenetic analysis of genomes and metagenomes [PeerJ].  This project was coordinated by Aaron Darling, who was a Project Scientist in my lab and is now a Professor at the University of Technology Sydney.  Also involved were Holly Bik (post doc in the lab), Guillaume Jospin (Bioinformatics Engineer in the lab), Eric Lowe (was a UC Davis undergrad working in the lab) and Erick Matsen (from the FHCRC).  This work was supported by a grant from the Department of Homeland Security.


Like all organisms on the planet, environmental microbes are subject to the forces of molecular evolution. Metagenomic sequencing provides a means to access the DNA sequence of uncultured microbes. By combining DNA sequencing of microbial communities with evolutionary modeling and phylogenetic analysis we might obtain new insights into microbiology and also provide a basis for practical tools such as forensic pathogen detection.

In this work we present an approach to leverage phylogenetic analysis of metagenomic sequence data to conduct several types of analysis. First, we present a method to conduct phylogeny-driven Bayesian hypothesis tests for the presence of an organism in a sample. Second, we present a means to compare community structure across a collection of many samples and develop direct associations between the abundance of certain organisms and sample metadata. Third, we apply new tools to analyze the phylogenetic diversity of microbial communities and again demonstrate how this can be associated to sample metadata.

These analyses are implemented in an open source software pipeline called PhyloSift. As a pipeline, PhyloSift incorporates several other programs including LAST, HMMER, and pplacer to automate phylogenetic analysis of protein coding and RNA sequences in metagenomic datasets generated by modern sequencing platforms (e.g., Illumina, 454).

For more about Phylosift see

Mini journal club: staged phage attack of a humanizes microbiome of mouse

Doing another mini journal club here.  Just got notified of this paper through some automated Google Scholar searches: Gnotobiotic mouse model of phage–bacterial host dynamics in the human gut

Full citation: Reyes, A., Wu, M., McNulty, N. P., Rohwer, F. L., & Gordon, J. I. (2013). Gnotobiotic mouse model of phage–bacterial host dynamics in the human gut. Proceedings of the National Academy of Sciences, 201319470.

The paper seems pretty fascinating at first glance. Basically they built on the Jeff Gordon germ free mouse model and introduced a defined set of cultured microbes that came from humans.  And then they stages a phage attack on the system and monitored the response of the community to the phage attack.

Figure 1 from Reyes et al.

They (of course) also did a control – in this case with heat killed phage.  And they compared what happened to the live phage.  I love this concept as they are able to control the microbial community and then test dynamics of how specific phage affect that community inside a living host.  Very cool.

Who are the microbes in your neighborhood? Quite a few are from Melainabacteria – a new phylum sister to Cyanobacteria

I just love this paper … The human gut and groundwater harbor non-photosynthetic bacteria belonging to a new candidate phylum sibling to Cyanobacteria | eLife from the labs of Ruth Ley and Jill Banfield (1st author is the co-first authors are Sara C. Di Rienzi and Itai Sharon).  It represents a landmark study in something that has intrigued many microbial diversity / human microbiome researchers for many years.  Early in the history of sequencing rRNA genes from human microbiome samples, researchers discovered something a bit weird – quite a few sequences were coming from what appeared to be close relatives of Cyanobacteria.  This was weird because all known Cyanobacteria were thought to be photosynthetic and – well – there is not too much light in the human gut.

Now – one possible explanation for this was that these sequences were coming from photosynthetic bacteria but these bacteria were not residents of the human gut but came via consumable items (i.e., food and drink).  Perhaps they were actually from chloroplasts of something in the diet (after all – chloroplasts are derived versions of cyanobacteria). This idea was discussed at many meetings I attended.  But there was no evidence for this.  Another possibility was that there was in fact some light in the human gut – leaking through from the outside or being produced from the inside. And perhaps this was enough to do a little photosynthesis.  Sound crazy?  Well, not so crazy after reports of photosynthesis in the deep sea.  A third possibility was that these sequences were coming from residents of the human gut that were related to (or even within) cyanobacteria but were not photosynthetic.  More detail on possible explanations are in this new paper and in some of the material cited therein.

Anyway – Ruth Ley has been discussing these unusual sequences for years and now in this paper her group and the group of Jill Banfield at Berkeley (along with some others) has used metagenomics and detailed assembly and phylogenetic analysis to reveal many new insights into these sequences.  I could write much more about this.  But, I think the paper really speaks for itself.  And it is open access so anyone and everyone can check it out.  And you should.  It is wonderful.

Fig 2 from Di Rienzi et al.

UPDATED 10/9/2013 to correct that there were co-first authors

Great use of metagenomic data: community wide adaptation signatures

OK I have been dreaming about doing something like this for many years.  One of the potentially most useful aspects of shotgun metagenomic data is that you get a sample of many/all members of a microbial community at once.  And then in theory one could look across different species and taxa and ask – do they all have similar adaptations in response to some sort of environmental pressure.  There have been a few papers on this over the last few years (e.g. check out this one from Muegge et al on Diet Driving Convergence in Gut Microbes).  But this new paper is really the type of thing I have been hoping to see: Environmental shaping of codon usage and functional adaptation across microbial communities.  Basically they looked at codon usage in organisms in different metagenomic samples and found major metagenome specific signatures, suggesting that different taxa were in essence converging on common codon usage.

The paper is definitely worth a look.

Guest post from Kimmen Sjölander about FAT-CAT phylogenomics pipeline

Below is a guest post from my friend and colleague Kimmen Sjölander, Prof. at UC Berkeley and phylogenomics guru. 

Announcing the FAT-CAT phylogenomic annotation webserver.

FAT-CAT is a new web server for phylogenomic prediction of function and ortholog identification and for taxonomic origin prediction of metagenome sequences based on HMM-based classification of protein sequences to >93K pre-calculated phylogenetic trees in the PhyloFacts database. PhyloFacts is unique among phylogenomic databases in having both broad taxonomic coverage – more than 7.3M proteins from >99K unique taxa across the Tree of Life, including targeted coverage of genomes from Eukaryotes, Bacteria and Archaea — and integrating functional data on trees for Pfam domains and multi-domain architectures. PhyloFacts trees include functional and annotation data from UniProt (SwissProt and TrEMBL), GO, BioCyc, Pfam, Enzyme Commission and other sources. The FAT-CAT pipeline uses HMMs at all nodes in PhyloFacts trees to classify user sequences to different levels of functional hierarchies, based on the subtree HMM giving the sequence the strongest score. Phylogenetic placements within orthology groups defined on PhyloFacts trees are used to to predict function and to predict orthologs. Sequences from metagenome projects can be classified taxonomically based on the MRCA of the sequences descending from the top-scoring subtree node. Because of the broad taxonomic and functional coverage, FAT-CAT can identify orthologs and predict function for most sequence inputs. We’re working to make FAT-CAT less computationally intensive so that users will be able to upload entire genomes for analysis; in the interim, we limit users to 20 sequence inputs per day. Registered users are given a higher quota (see details online). We’d love to hear from you if you have feature requests or bug reports; please send any to Kimmen Sjölander – kimmen at berkeley dot edu (parse appropriately). 

Cool new paper from DeLong lab: Pattern and synchrony of gene expression among sympatric marine microbial populations

Definitely worth looking at this paper if you are interested in uncultured microbes: Pattern and synchrony of gene expression among sympatric marine microbial populations.  From Ed Delong and team, it is published under the “Open” pathway in PNAS.

Also see press release here: Scientists track ocean microbe populations in their natural habitat to …

Interesting new #PLOS One paper on study design in rRNA surveys

Interesting new paper in PLoS One:  PLOS ONE: Taxonomic Classification of Bacterial 16S rRNA Genes Using Short Sequencing Reads: Evaluation of Effective Study Designs

Abstract: Massively parallel high throughput sequencing technologies allow us to interrogate the microbial composition of biological samples at unprecedented resolution. The typical approach is to perform high-throughout sequencing of 16S rRNA genes, which are then taxonomically classified based on similarity to known sequences in existing databases. Current technologies cause a predicament though, because although they enable deep coverage of samples, they are limited in the length of sequence they can produce. As a result, high-throughout studies of microbial communities often do not sequence the entire 16S rRNA gene. The challenge is to obtain reliable representation of bacterial communities through taxonomic classification of short 16S rRNA gene sequences. In this study we explored properties of different study designs and developed specific recommendations for effective use of short-read sequencing technologies for the purpose of interrogating bacterial communities, with a focus on classification using naïve Bayesian classifiers. To assess precision and coverage of each design, we used a collection of ~8,500 manually curated 16S rRNA gene sequences from cultured bacteria and a set of over one million bacterial 16S rRNA gene sequences retrieved from environmental samples, respectively. We also tested different configurations of taxonomic classification approaches using short read sequencing data, and provide recommendations for optimal choice of the relevant parameters. We conclude that with a judicious selection of the sequenced region and the corresponding choice of a suitable training set for taxonomic classification, it is possible to explore bacterial communities at great depth using current technologies, with only a minimal loss of taxonomic resolution.

Not sure I like everything in the paper.  For example, they focus on naive Bayesian classification methods … when (of course) I prefer phylogenetic methods.  But that is a small issue.  Overall there is a lot of useful detail in here about rRNA based taxonomic studies.  I note – some of this probably applies to metagenomic studies as well … perhaps this group will do a comparable analysis of metagenomics next?

Mizrahi-Man O, Davenport ER, Gilad Y (2013) Taxonomic Classification of Bacterial 16S rRNA Genes Using Short Sequencing Reads: Evaluation of Effective Study Designs. PLoS ONE 8(1): e53608. doi:10.1371/journal.pone.0053608

I note – if you want to catch up / learn / research metagenomics and phylogeny or classification check out the Mendeley group I started on the topic:

Attention all metagenomicists: put your pinky in the corner of your mouth & say "1 million dollars"

Already posted this to Twitter and Facebook but had to post here too.  This is wild.  DTRA has announced a $1 million prize for metagenomic analysis: DTRA Algorithm Challenge | Landing Page.  From their page

The Prize:
As nth generation DNA sequencing technology moves out of the research lab and closer to the diagnostician’s desktop, the process bottleneck will quickly become information processing. The Defense Threat Reduction Agency (DTRA) and the Department of Defense are interested in averting this logjam by fostering the development of new diagnostic algorithms capable of processing sequence data rapidly in a realistic, moderate-to-low resource setting. With this goal in mind, DTRA is sponsoring an algorithm development challenge. 

The Challenge:
Given raw sequence read data from a complex diagnostic sample, what algorithm can most rapidly and accurately characterize the sample, with the least computational overhead?

My instinct is to keep this to myself because, well, I want to win.  But my sharing side of things won out and I am posting here.  Maybe we (i..e, the community) can develop an open, collaborative project to do this?  Just a thought …

People not Projects: the Moore Foundation continues to revolutionize marine microbiology w/ its Investigator program

People not Projects.

It is such a simple concept.  But it is so powerful.  I first became aware of this idea as it relates to funding scientific research in regard to the Howard Hughes Medical Institute’s Investigator program.  Their approach (along with a decent chunk of money) has helped revolutionize biomedical science.  And thus I was personally thrilled to see the introduction of this concept in the area of Marine Microbiology a few years back with the Gordon and Betty Moore Foundation’s “Marine Microbiology Initiative Investigator” program.  Launched in 2004 it helped revolutionize marine microbiology studies in the same way HHMI’s investigator program revolutionized biomedical studies.

The first GBMF MMI Investigator program ran from 2004 -2012. And the people supported were pretty darn special:

Now I am I suppose a little biased in this because at the same time GBMF launched this program they also put a bunch of money into the general area of Marine Microbiology and I have been the recipient of some of that money.  For example, I got a small amount of money as part of the GBMF Funded work at the J. Craig Venter Institute on the Sargasso Sea and Global Ocean Sampling metagenomic sequencing projects and also had a subcontract from UCSD/JCVI to do some work as part of the “CAMERA” metagenomic database project.  I ended up being a coauthor on a diverse collection of papers associated with these projects including Sargasso metagenome and this review, and GOS1GOS2 and my stalking the 4th domain paper.

I am also a bit biased in that I have worked with many of the people on the initial MMI Investigator list some before, some after the awards including papers with Jen Martiny, Ed Delong, Alex Worden and Ginger Armbrust, and Mary Ann Moran.

But perhaps most relevant in terms of possible bias towards the Gordon and Betty Moore Foundation is that in 2007 my lab received funds through the MMI program for a collaborative project with Jessica Green and Katie Pollard for our “iSEEM” project on “Integrating Statistical, Ecological and Evolutionary analyses of Metagenomic Data” (see which was one of the most successful collaborations in which I have ever been involved.  This project produced something like a dozen papers and many major new developments in analyses of metagenomic data including 16S copy correction, sifting families, microbeDB, PD of metagenomes, WATERs, BioTorrents, AMPHORA. and STAP.  This project just ended but Katie Pollard and I just got additional funds from GBMF to continue related work.

So sure – I am biased.  But the program is simply great.  In the eight years since the initial grants the Gordon and Betty Moore Foundation has helped revolutionize marine microbiology.  And a lot of this came from the Investigator program and it’s emphasis on people not projects.  I note – the Moore Foundation has clearly decided that this “people not projects” concept is a good one.  A few years ago they partnered with HHMI to launch a Plant Sciences Investigator Program  which I wrote about here.

It was thus with great excitement that I saw the call for applications for the second round of the MMI Investigator program.  I certainly pondered applying.  But for many reasons I decided not to.  And today the winners of this competition have been announced and, well, it is an very impressive crew:

Some of the same crowd as the previous round.  Some new people.  Some people not there from the previous round.  All of them are rock stars in their areas especially if one takes into account how senior they are (the more junior people are stars in development).  And all have done groundbreaking work in various areas relating to marine microbiology.  The organisms covered here run the gamut including viruses, bacteria, archaea, and microbial eukaryotes.  The areas of focus covered range from biogeochemistry to ecosystem modeling with everything in between.  It really is an impressive group. Delong pioneered metagenomics and helped launch studies of uncultured microbes in the oceans.  Karl has led the Hawaii Ocean Time series and done other brilliant work.  Sullivan and Rohwer and pushing the frontiers of viral studies in the oceans.  Allen, Armbrust, and Worden are among the leaders in genomic studies of microbial eukaryotes in the marine environment.   Dubilier, Bidle, Fuhrman and Follows Stocker (double listed Follows in original post …) – though they focus on very different aspects of marine microbes – are helping lead the charge in understanding interactions across the domains of life in the marine environment.  Orphan, Saito, Deutsch, Follows and Pearson are on the cutting edge of biogeochemical studies and trying to link experimental studies of microbes to biogeochemistry of oceans.

The great thing about the “people not projects” concept is that the people funded here get to follow their own path.  They are not going to be constrained by the complications and sometime idiocy of the grant review process.  They in essence get to do whatever they want.  Freedom to follow their noses.  Or their guts.  Or whatever.  It is a refreshing concept and as mentioned above has been revolutionary in various areas of science.  There has been a slow but steady spread of the “people not projects” concept to various federal agencies too but it seems to be more of a private foundation type of strategy.  Federal Agencies are so risk averse in funding that this type of concept does not work well there.  I wish there was more.  But I am at least thankful for what HHMI and GBMF and Wellcome and Sloan and other private groups are doing in this regard.  Now – sure – all of these private foundations do not do everything perfectly.  They have blunders here and there like everyone else.  But without a doubt I think we need more of the People not Projects concept.
Oh – and another good thing.  GBMF is quite a big supporter of Open Science in it’s various guises.  So one can expect much of the data, software, and papers from their funding to be widely and openly available.   
It is a grand time to be doing microbiology largely due to revolutions in technology and also to changes in the way we view microbes on the planet.  It is an even grander time to be doing marine microbiology due to the dedication of the Gordon and Betty Moore Foundation to this important topic.  

Story behind the Paper: Functional biogeography of ocean microbes

Guest Post by Russell Neches, a PhD Student in my lab and Co-Author on a new paper in PLoS One.  Some minor edits by me.

For this installment of the Story Behind the Paper, I’m going to discuss a paper we recently published in which we investigated the geographic distribution of protein function among the world’s oceans. The paper, Functional Biogeography of Ocean Microbes Revealed through Non-Negative Matrix Factorization, came out in PLOS ONE in September, and was a collaboration among Xingpeng Jiang (McMaster, now at Drexel), Morgan Langille (UC Davis, now at Dalhousie), myself (UC Davis), Marie Elliot (McMaster), Simon Levin (Princeton), Jonathan Eisen (my adviser, UC Davis), Joshua Weitz (Georgia Tech) and Jonathan Dushoff (McMaster).

Using projections to “see” patterns in complex biological data

Biology is notorious for its exuberant abundance of factors, and one of its central challenges is to discover which among a large group of factors are important for a given question. For this reason, biologists spend a lot of time looking at tables that might resemble this one :

sample A
sample B
sample C
factor 1
factor 2
factor 3

Which factors are important? Which differences among samples are important? There are a variety of mathematical tools that can help distill these tables into something perhaps more tractable to interpretation. One way or another, all of these tools work by decomposing the data into vectors and projecting them into a lower dimensional space, much the way object casts a shadow onto a surface. 

The idea is to find a projection that highlights an important feature of the original data. For example, the projection of the fire hydrant onto the pavement highlights its bilateral symmetry.

So, projections are very useful. Many people have a favorite projection, and like to apply the same one to every bunch of data they encounter. This is better than just staring at the raw data, but different data and different effects lend themselves to different projections. It would be better if people generalized their thinking a little bit.

When you make a projection, you really have three choices. First, you have to choose how the data fits into the original space. There is more than one valid way of thinking about this. You could think about it as arranging the elements into vectors, or deciding what “reshuffling” operations are allowed. Then, you have to choose what kind of projection you want to make. Usually people stick with some flavor of linear transformation. Last, you have to choose the space you want to make your projection into. What dimensions should it have? What relationship should it have with the original space? How should it be oriented?

In the photograph of the fire hydrant, the original data (the fire hydrant) is embedded in a three dimensional space, and projected onto the ground (the lower dimensional space) by the sunlight by casting a shadow. The ground happens to be roughly orthogonal to the fire hydrant, and the sunlight happens to fall from a certain angle. But perhaps this is not the ideal projection. Maybe we’d get a more informative projection if we put a vertical screen behind the fire hydrant, and used a floodlight? Then we’d be doing the same transformation on the same representation of the data, but into a space with a different orientation. Suppose we could make the fire hydrant semi-transparent, we placed it inside a tube-shaped screen, and illuminated the fire hydrant from within? Then we’d be using a different representation of the original data, and we’d be doing a non-linear projection into an entirely different space with a different relationship with the original space. Cool, huh?

It’s important to think generally when choosing a projection. When you start trying to tease some meaning out of a big data set, the choice of principal component analysis, or k-means clustering, or canonical correlation analysis, or support vector machines, has important implications for what you will (or won’t) be able to see.

How we started this collaboration: a DARPA project named FunBio

Between 2005 and 2011, DARPA had a program humbly named The Fundamental Laws of Biology (FunBio). The idea was to foster collaborations among mathematicians, experimental biologists, physicists, and theoretical biologists — many of whom already bridged the gap between modeling and experiment. Simon Levin was the PI of the project and Benjamin Mann was the program officer. The group was large enough to have a number of subgroups that included theorists and empiricists, including a group focused on ecology. Jonathan Eisen was the empiricist for microbial ecology, and was very interested in the binning problem for metagenomics; that is, classifying reads, usually by taxonomy. Conversations in and out of the program facilitated the parallel development of two methods in this area: LikelyBin (led by Andrey Kislyuk and Joshua Weitz with contributions from Srijak Bhatnagar and Jonathan Dushoff) and CompostBin (led by Sourav Chatterji and Jonathan Eisen with contributions from collaborators). At this stage, the focus was more on methods than biological discoveries.

The binning problem highlights some fascinating computational and biological questions, but as the program developed, the group began to tilt in direction of biological problems. For example, Simon Levin was interested in the question: Could we identify certain parts of the ocean that are enriched for markers of social behavior?

One of the key figures in any field guide is a ecosystem map. These maps are the starting point from which a researcher can orient themselves when studying an ecosystem by placing their observations in context. 

Handbook of Birds of the Western United States, Florence Merriam Bailey 
There are a variety of approaches one could take that entail deep questions about how best to describe variation in taxonomy and function. For example, we could try to find “canonical” examples of each ecosystem, and then perhaps identify intermediates between them. Similarly, we could try and find “canonical” examples of the way different functions are distributed across ecosystem and identify intermediates between them.

In the discussions that followed, we discussed how to describe the functional and taxonomic diversity in a community as revealed via metagenomics; that is, how do we describe, identify and categorize ecosystems and associated function? In order to answer this question, we had to confront a difficult issue: how to quantify and analyze metagenomic profile data.

Metagenomic profile data: making sense of complexity at the microbe scale

Metagenomics is perhaps the most pervasive cause of the proliferation of giant tables of data that now beset biology. These tables may represent the proportion of taxa at different sites, e.g., as measured across a transect using effective taxonomic units as proxies for distinct taxa. Among these giant tables, one of the challenges that has been brought to light is that there can be a great deal of gene content variability among individuals of an individual taxa. As a consequence, obtaining the taxonomic identities of organisms in an ecosystem is not sufficient to characterize the biological functions present in that community. Furthermore, ecologists have long known that there are often many organisms that could potentially occupy a particular ecological niche. Thus, using taxonomy as a proxy for function can lead to trouble in two different ways; the organism you’ve found might be doing something very different from what it usually does, and second, the absence of an organism that usually performs a particular function does not necessarily imply the absence of that function. So, it’s rather important to look directly at the genes in the ecosystem, rather than taking proxies. You can see where this is going, I’m sure: Metagenomics, the cure for the problems raised by metagenomics!

When investigating these ecological problems, it is easy to take for granted the ability to distinguish one type of environment from another. After all, if you were to wander from a desert to a jungle, or from forest to tundra, you can tell just by looking around what kind of ecosystem you are in (at least approximately). Or, if the ecosystems themselves are new to you, it should at least be possible to notice when one has stepped from one into another. However, there is a strong anthropic bias operating here, because not all ecosystems are visible on humans scales. So, how do you distinguish one ecosystem from another if you can’t see either?

One way is to look at the taxa present, but that works best if you are already somewhat familiar with that ecosystem. Another way is to look at the general properties of the ecosystem. With microbial ecosystems, we can look at predicted gene functions. Once again, this line of reasoning points to metagenomics.

We wanted to use a projection method that avoids drawing hard boundaries, reasoning that hard boundaries can lead to misleading results due to over-specification. Moreover, in doing so Jonathan Dushoff advocated for a method that had the benefits of “positivity”, i.e., the projection would be done in a space where the components and their weights were positive, consistent with the data, and which would help the interpretability of our results. This is the central reason why we wanted to use an alternative to PCA. The method Jonathan Dushoff suggested was Non-negative Matrix Factorization (NMF). This choice led to a number of debates and discussions, in part, because NMF is not a “standard” method (yet). Though, it has seen increasing use within computational biology: It is worth talking about these issues to help contextualize the results we did find.

The Non-negative Matrix Factorization (NMF) approach to projection

The conceptual idea underlying NMF (and a few other dimensional reduction methods) is a projection that allows entities to exist in multiple categories. This turns out to be quite important for handling corner cases. If you’ve ever tried to build a library, you’ve probably encountered this problem. For example, you’ve probably created categories like Jazz, Blues, Rock, Classical and Hip Hop. Inevitably, you find artists who don’t fit into the scheme. Does Porgy and Bess go into Jazz or Opera? Does the soundtrack for Rent go under Musicals or Rock? What the heck is Phantom of the Opera anyway? If your music library is organized around a hierarchy of folders, this can be a real headache, and either results either in sacrificing information by arbitrarily choosing one legitimate classification over another, or in creating artistically meaningless hybrid categories.

This problem can be avoided by relaxing the requirement that each item must belong to exactly one category. For music libraries, this is usually accomplished by representing categories as attribute tags, and allowing items to have more than one tag. Thus, Porgy and Bess can carry the tags Opera, Jazz and Soundtrack. This is more informative and less brittle.

NMF accomplishes this by decomposing large matrices into smaller matrices with non-negative components. These decompositions often do a better job at clustering data than eigenvector based methods for the same reason that tags often work better for organizing music than folders. In ecology, the metabolic profile of a site could be represented as a linear combination of site profiles, and the site profile of a taxonomic group could be represented as a linear combination of taxonomic profiles. When we’ve tried this, we have found that although many sites, taxa and Pfams have profiles close to these “canonical” profiles, many are obviously intermediate combinations. That is to say, they have characteristics that belong to more than one classification, just as Porgy and Bess can be placed in both Jazz and Opera categories with high confidence. Because the loading coefficients within NMF are non-negative (and often sparse), they are easy to interpret as representing the relative contributions of profiles.

What makes NMF really different from other dimensional reduction methods is that these category “tags” are positive assignments only. Eigenvector methods tend to give both positive and negative assignments to categories. This would be like annotating Anarchy in the U.K. by the Sex Pistols with the “Classical” tag and a score of negative one, because Anarchy in the U.K. does not sound very much like Frédéric Chopin’s Tristesse or Franz Liszt’s Piano Sonata in B minor. While this could be a perfectly reasonable classification, it is conceptually very difficult to wrap one’s mind around concepts like non-Punk, anti-Jazz or un-Hip-Hop. From an epistemological point of view, it is preferable to define things by what they are, rather than by what they are not.

To give you an idea of what this looks like when applied to ecological data, it is illustrative to see how the Pfams we found in the Global Ocean Survey cluster with one another using NMF, PCA and direct similarity:

While PCA seems to over-constrain the problem and direct similarity seems to under-constrain the problem, NMF clustering results in five or six clearly identifiable clusters. We also found that within each of these clusters one type of annotated function tended to dominate, allowing us to infer broad categories for each cluster: Signalling, Photosystem, Phage, and two clusters of proteins with distinct but unknown functions. Finally – in practice, PCA is often combined with k-means clustering as a means to classify each site and function into a single category. Likewise, NMF can be used with downstream filters to interpret the projection in a “hard” or “exclusive” manner. We wanted to avoid these types of approaches.

Indeed, some of us had already had some success using NMF to find a lower-dimensional representation of these high-dimensional matrices. In 2011, Xingpeng Jiang, Joshua Weitz and Jonathan Dushoff published a paper in JMB describing a NMF-based framework for analyzing metagenomic read matrices. In particular, they introduced a method for choosing the factorization degree in the presence of overlap, and applied spectral-reordering techniques to NMF-based similarity matrices to aid in visualization. They also showed a way to robustly identify the appropriate factorization degree that can disentangle overlapping contributions in metagenomics data sets.

While we note the advantages of NMF, we should note it comes with caveats. For example, the projection is non-unique and the dimensionality of the projection must be chosen carefully. To find out how we addressed these issues, read on!

Using NMF as a tool to project and understand metagenomic functional profile data

We analyzed the relative abundance of microbial functions as observed in metagenomic data taken from the Global Ocean Survey dataset. The choice of GOS was motivated by our interest in ocean ecosystems and by the relative richness of metadata and information on the GOS sites that could be leveraged in the course of our analysis. In order to analyze microbial function, we restricted ourself to the analysis of reads that could be mapped to Pfams. Hence, the matrices have columns which denote sampling sites, and rows which denote distinct Pfams. The values in the cell denotes the relative number of Pfams matches at that site, where we normalize so that the sum of values in a column equals 1. In total, we ended up mapping more than six million reads into a 8214 x 45 matrix.

We then utilized NMF tools for analyzing metagenomic profile matrices, and developed new methods (such as a novel approach to determining the optimal rank), in order to decompose our very large 8214 x 45 profile matrix into a set of 5 components. This projection is the key to our analysis, in that it highlights the most of the meaningful variation and provides a means to quantify that variation. We spent a lot of time talking among ourselves, and then later with our editors and reviewers, about the best way to explain how this method works. Here is our best effort from the Results section that explains what these components represent :

Each component is associated with a “functional profile” describing the average relative abundance of each Pfam in the component, and with a “site profile”, describing how strongly the component is represented at each site. 

A component has both a column vector representing how much each Pfam contributes to the component and a row vector representing the importance of that component at different sites. Each Pfam may be associated with more than one component. Likewise, each component can have a different strength at each site. Remember, the music library analogy? This is how NMF achieves the effect of category “tags” which can label overlapping sets of items, rather than “folders” which must contain mutually exclusive sets.

Such a projection does not exclusively cluster sites and functions together. We discovered five functional types, but we are not claiming that any of these five functional types are exclusive to any particular set of sites. This is a key distinction from concepts like enterotypes.

What we did find is that of these five components, three of them had an enrichment for Pfams whose ontology was often identified with signalling, phage, and photosystem function, respectively. Moreover, these components tended to be found in different locations, but not exclusively so. Hence, our results suggest that sampling locations had a suite of functions that often co-occurred there together.

We also found that many Pfams with unknown functions (DUFs, in Pfam parlance) clustered strongly with well-annotated Pfams. These are tantalizing clues that could perhaps lead to discovery of the function of large numbers currently unknown proteins. Furthermore, it occurred to us that a larger data set with richer metadata might perhaps indicate the function of proteins belonging to clusters dominated by DUFs. Unfortunately, we did not have time to fully pursue this line of investigation, and so, with a wistful sigh, we kept in the the basic idea, with more opportunities to consider this in the future. We also did a number of other analyses, including analyzing the variation in function with respect to potential drivers, such as geographic distance and environmental “distance”. This is all described in the PLoS ONE paper.

So, without re-keying the whole paper, we hope this story-behind-the-story gives a broader view of our intentions and background in developing this project. The reality is that we still don’t know the mechanisms by which components might emerge, and we would still like to know where this this component-view for ecosystem function will lead. Nevertheless, we hope that alternatives to exclusive clustering will be useful in future efforts to understand complex microbial communities.

Full Citation: Jiang X, Langille MGI, Neches RY, Elliot M, Levin SA, et al. (2012) Functional Biogeography of Ocean Microbes Revealed through Non-Negative Matrix Factorization. PLoS ONE 7(9): e43866. doi:10.1371/journal.pone.0043866.