OK. Got a question for the blogo-twitto-webosphere.
In this day an age of rapid shotgun sequencing of genomes, many people are moving away from finishing the genomes. As some may know, I was a co-author many years ago on a paper arguing for the need to finish (rather than just shotgun sequence) microbial genomes for many scientific questions. But as sequencing costs continued to plummet, the relative cost of shotgun sequencing genomes kept going down while the cost of finishing genomes did not change much. So. two years ago I posted to my blog a question regarding this: “Wanted:Feedback on Importance of Finishing (Microbial) Genomes” and got a lot of useful feedback. And eventually, I came around to the argument that finishing was unnecessary for many purposes (and even got some props for admitting I was wrong).
Well, I am back. Now I am arguing to some colleagues that if we want to make metabolic pathway predictions and metabolic models for genomes, we probably don’t need finished genomes. But alas, I have no evidence to back that up. And in fact, I am not really sure anyway. So I am asking everyone and anyone out there … does anyone have data/evidence/opinions about whether there will be much difference in metabolic predictions one would make for an organism based upon a complete genome vs. a shotgun assembly of a genome generated by Illumina sequencing?
Storification of some comments:
http://storify.com/phylogenomics/completeness-of-microbial-genomes-for-metabolic-pr.js[View the story “Completeness of microbial genomes for metabolic predictions” on Storify]
12 thoughts on “How complete do microbial genomes have to be for metabolic predictions (to be useful)?”
From a practical personal perspective, with bacterial draft genomes (though I flatter myself that they're *good* drafts 😉 – mostly 454 rather than Illumina, which needed a bit of help from 454 in a hybrid asm) we can get around 1300-1400 metabolic reactions from 1100-1200 annotated genes. This is at high 90% coverage of total expected genes from gold-standard Sanger sequenced close relatives.
I wonder how much farther on we are than the situation reported in Chen L, Vitkup D. Trends Biotechnol. 2007 Aug;25(8):343-8. “Distribution of orphan metabolic activities” – (http://www.ncbi.nlm.nih.gov/pubmed/17580095). So long as 30-40% of metabolism remains unassociated with gene function, I think the difference between closed and draft genome coverage of metabolism is likely to be marginal, and not the larger part of the problem.
I generally agree – given that many predictions of individual protein function are inaccurate and given that for many proteins we cannot predict ANY function and given that many genes are not involved in metabolism anyway, I just don't see how even 5% of the genome would make much of a difference in accuracy of overall metabolic predictions.
Note – I sent a related question to Patrik D'haeseleer who has worked on metabolic modeling from genomes.
“know anything about how complete genomes have to be for good metabolic modeling?”
And he wrote
“Hm… I'm not sure anyone has formally quantified that yet. It would also depend a lot on what type of metabolic modeling we're talking about. We've been doing metabolic reconstruction using Pathway Tools on phylogenetic bins from metagenomes, where we typically have 80-90% of a genome in a few hundred contigs.
Pathway Tools uses some fairly lenient heuristics to decide whether a pathway is present, based on the enzyme annotations it finds in the (meta)genome, so you definitely don't have to have *all* the enzymes in a pathway. The more qualitative analysis of pathway content that Pathway Tools produces can be useful to assign broad metabolic roles, such as “organism A has more aromatic degradation pathways, whereas organism B has more oligosaccharide degradation”.
Missing enzymes can be a much bigger issue when you're trying to do flux balance modeling. Heck, flux balance analysis can be hard enough to do when you have a complete genome. But if you don't have an isolate you may not be interested in flux balance analysis anyway.
Given that 30-40% of genes in complete genomes don't have any functional annotations, and that a good number of those are presumably unannotated enzymes, you never really know all the enzymes in a genome anyway. As long as the amount of missing sequence is negligible compared to the amount of misannotation and missing annotation, you should be able to at least get a fairly accurate overview of the metabolism of the organism. I probably wouldn't bother trying to do a metabolic reconstruction with less than 70% of the genome though.
One interesting application is in using single cell sequencing to predict cultivation conditions of unculturable organisms. In that case, you're not really interested in getting the whole-genome metabolic reconstruction anyway, but you're looking for key biosynthesis or degradation pathways. Since the enzymes in a pathway tend to cluster on the genome, you may be able to find entire pathways even with very partial genome sequence, which may allow you to rule out the need to supplement the medium with an essential amino acid, or may suggest unusual nutrients the organism may be able to grow on. Conversely, because enzymes tend to cluster, you may not be able to confirm that a specific pathway is missing untill you have almost all of its sequence.
I haven't closed the last dozen or so prok genomes I've been involved in, and I've come to the realization that few if any genomes in the future will be closed, but there certainly things that are lost by not doing so.
Meaningful studies of genomic structure can't really be done on drafts because typically the pseudomolecules created from drafts are created on the basis of a previous closed genome rather than testing the hypothesis that the structure is preserved.
I'm interested in the evolution of metabolic pathways. I just read a paper on “The Emergence and Early Evolution of Biological Carbon-Fixation” [Braakman & Smith. 2012] and a review on the same subject [“Beyond the Calvin Cycle: Autotrophic Carbon Fixation in the Ocean,” Hügler & Sievert, 2011].
Both papers rely on genome sequences to decide whether certain pathways exist in certain species. Often those decisions depend on the presence or absence of only one or two enzymes. Obviously, if the gene is present then there's no problem. But if the gene is apparently missing from an incomplete sequences, the conclusions have to be more tentative.
That's an argument in favor of finishing sequences but you can compensate to some extend by comparing partial genomic sequences of a large number of related species. All in all, I'd say the advantage of having one thousand “unfinished” genomes is greater than having ten “finished” sequences.
I'm with you Larry – finishing is good — but we need to figure out a way to do it really cheaply for it to be worth doing these days when the cost is equivalent to doing hundreds to 1000s of more shotgun genomes
De bruijn graph assemblers split contigs where multiple paths in the graph share one or more kmers.
Sure, this happens in repeat regions. But it also happens in gene families which often share kmers. Thus, genes in gene families are broken across multiple contigs. Scaffolding doesn't help as N's are placed in the gaps.
So metabolic modelling on incomplete genomes under represents genes in gene families.
Quick finishing == Illumina plus PacBio
In future == nanopore 100kb reads
We review problems of prok genome annotation here http://m.bib.oxfordjournals.org/content/early/2012/03/08/bib.bbs007.short?rss=1
It seems to me that metabolic modeling always contains a great deal of assumptions or simplifications in order to 'model' what we can. Are there any metabolic models that can take carbon and nitrogen sources and have all the reactions necessary to make the organism? There is often just an output called 'Biomass' or growth. But people make models in order to make predictions. I think the goal should be to have as many described reactions as possible. More reactions, more predictive space. But having just the annotations you need for carbon catabolism can also be useful. You work with what you have and grow from that point. Lack of full annotation should not be a reason to not attempt a metabolic model.
This gives you an idea of what fraction of metabolic enzymes are currently missing from standard microbial genome annotations:
The effect on the flux models appears to be fairly small, but significant: “We augmented the currently available genome-scale metabolic models with these new sequence–function associations and were able to expand the models by on average 8%, with a considerable change in the flux connectivity patterns and improved essentiality prediction.”
Nice- very very useful paper.
Interesting point. Of course, if those genes have the same enzymatic function, then you really only need one to be correctly annotated, to get the correct pathway reconstruction.
So at least some types of analyses will be less sensitive to breaks in gene families…