Interesting new metagenomics paper w/ one big big big caveat – critical software not available "

Very very strange. There is an interesting new metagenomics paper that has come out in Science this week. It is titled “Untangling Genomes from Metagenomes: Revealing an Uncultured Class of Marine Euryarchaeota” and it is from the Armbrust lab at U. Washington.

One of the main points of this paper is that the lab has developed software that apparently can help assemble the complete genomes of organisms that are present in low abundance in a metagenomic sample. At some point I will comment on the science in the paper, (which seems very interesting) though as the paper in non Open Access I feel uncomfortable doing so since many of the readers of this blog will not be able to read it.

But something else relating to this paper is worth noting and it is disturbing to me. In a Nature News story on the paper by Virginia Gewin there is some detail about the computational method used in the paper:

“He developed a computational method to break the stitched metagenome into chunks that could be separated into different types of organisms. He was then able to assemble the complete genome of Euryarchaeota, even though it was rare within the sample. He plans to release the software over the next six months.”

What? It is imperative that software that is so critical to a publication be released in association with the paper. It is really unacceptable for the authors to say “we developed a novel computational method” and then to say “we will make it available in six months”. I am hoping the authors change their mind on this but I find it disturbing that Science would allow publication of a paper highlighting a new method and then not have the method be available. If the methods and results in a paper are not usable how can one test/reproduce the work?

Author: Jonathan Eisen

I am an evolutionary biologist and a Professor at U. C. Davis. (see my lab site here). My research focuses on the origin of novelty (how new processes and functions originate). To study this I focus on sequencing and analyzing genomes of organisms, especially microbes and using phylogenomic analysis View all posts by Jonathan Eisen

23 thoughts on “Interesting new metagenomics paper w/ one big big big caveat – critical software not available "”

Nick Loman says:

February 3, 2012 at 3:29 pm

I already posted this on Twitter but I really think that peer reviews have a responsibility to insist that software developed for a manuscript is both available and open-source before publication. Ideally this would be in some trusted location like Github, Sourceforge or Google Code. This also means reviewers can access it without giving away their identity (if this is an issue for them, I don't usually care and have taken to signing my reviews).

LikeLike

Reply
Titus Brown says:

February 3, 2012 at 3:49 pm

Our software implementation to do a similar thing (we don't split the graph heuristically) is, in fact, on github. And hey, look, the submitted paper is available, too! http://arxiv.org/abs/1112.4193. It's still in review, though.

LikeLike

Reply
Titus Brown says:

February 3, 2012 at 3:56 pm

And Jonathan, I'm happy to comment on the science for you, since we've been pursuing this approach for about 2 years, although I would need to run some tests on their data set first. From skimming, the only real weakness is that they run an assembly first, and then partition the assembled data. Since many assemblers perform poorly on raw metagenomic data, this is unlikely to be as comprehensive as it could be. Also note that similar-in-style (although more heuristic) approaches were used in the rumen paper (Hess et al.) and the Arctic permafrost paper (Mackelprang, 2011). Good stuff, all in all.

LikeLike

Reply
Titus Brown says:

February 3, 2012 at 3:58 pm

Last comment: sea http://armbrustlab.ocean.washington.edu/seastar. Right now it says “will be updated week of Feb 6th.”

LikeLike

Reply
Julie says:

February 3, 2012 at 4:00 pm

Is this really a “rare” bug given it made up 7.5% of the sample? I would also note, from that news story, that many Euryarchaeota have been cultured and sequenced, just not this one! “One of those genomes came from the Euryarchaeota, a widespread group of marine microorganisms, none of which have been grown in culture or sequenced.”

LikeLike

Reply
Jonathan Eisen says:

February 3, 2012 at 4:00 pm

Yes – it says “information will be updated” – it does not say software will be made available

LikeLike

Reply
Jonathan Eisen says:

February 3, 2012 at 4:07 pm

I note – I have written to the software developer to encourage him to make it available ASAP …

LikeLike

Reply
Guy Leonard says:

February 3, 2012 at 4:15 pm

Good, I was about to post asking if anyone had written to the authors prior to jumping on them and executing them.

There could be a valid reason, though I imagine from the blog post, no reason would be entirely acceptable.

LikeLike

Reply
Sheri says:

February 3, 2012 at 4:25 pm

To follow up on Julie's comment, the sample containing MG-II had 14.5 Gbp of filtered reads, so 7.5% of that gives ~1 Gb of sequence corresponding to MG-II. At a genome size of 2 Mbp this gives ~500x coverage. Baker et al. assembled genomes of the tiny euryarc ARMAN-2 from 100 Mbp of Sanger metagenomic data (15x coverage). I'd be interested how well the new assembly method works for genome reconstruction with less than 500x coverage.

LikeLike

Reply
Jonathan Eisen says:

February 3, 2012 at 5:30 pm

Note – Virginia Gewin did contact me about commenting on the paper but we did not connect so I did not talk to her about her story. I was contacted by Biotechniques and they wrote an article.

LikeLike

Reply
Jonathan Eisen says:

February 3, 2012 at 5:33 pm

Here are my full answers to the reporters questions about the paper:

1. What has made it challenging for scientists to study microorganisms such as marine archaea?

There are multiple challenges to studying microbes such as marine
archaea. These include

1) They are small. This may seem to be an obvious issue, but their
small size makes is difficult to study the activities of microbes in
the field. Whereas one can observe multicellular organisms (e.g.,
plants, animals, macrofungi, kelp, etc) directly in the field and
record certain actions (e.g., what mammals eat), studying the activity
of microbes in the field is more challenging.

2) Even when one can do field observations, one major problem is that
the appearance of a microbes is not a good indicator of what kind of
organism it is. Thus you might see something but not know what it
was.

3) One way around these issues is to grow microbes in the lab –
something known as culturing. Culturing allows one to determine many
of the characteristics of individual kinds of microbes.

4) Unfortunately, many microbes cannot be grown in the lab.

2. How does this new technique for isolating individual genomes improve on past methods?

In general, analysis of the DNA (as well as RNA and proteins) of
microbes in the field allows one to learn a great deal about microbes
in a sample without growing them in the lab. Genome sequencing of
microbes in the field (something generally known as metagenomics)
allows one to make many inferences and predictions about the biology
and evolution of microbes. Unfortunately, piecing together entire
genomes of microbes in the field can be hard.

I note – it is hard to tell from the main text of the paper just what,
if anything, has been done new here in terms of techniques. There is
a lot of supplemental information I do not have access to.

I note – we have assembled complete / nearly complete genomes via
metagenomics previously – as have others. In most cases this prior
work has been done in very low diversity samples (e.g.,
endosymbionts). This new work appears to be somewhat unique in
assembling nearly complete genomes from complex communities.

3. What do you think the implications of this technique will be?

Hard to know – this depends on whether and how they make the methods
they used broadly available. Is the software going to be available
for all to use? Did they have to use specialized lab methods and if
so, will anyone be able to make them work or are they complex?

4. If scientists can now begin to study the genomes of microorganisms like marine archaea, how could that help with our overall understanding of our environment?

Microbes run the planet. We desperately need to better understand
their contributions to all ecosystems and to the functions occurring
therein. What we need is a field guide for microbes – akin to what we
have for birds – that would give us a picture of the current and past
details of microbial life on the planet. Only then will we be able to
have a full understanding of the planet as well as make predictions
about the future …

LikeLike

Reply
-DG says:

February 3, 2012 at 9:16 pm

There is always the possibility that the authors will make the “as is, used in the publication” code available to anyone on request and what is being released within the next 6 months is the user friendly full on useful program. Fairly normal in my experience to be ready to publish before you really have a nice user-friendly implementation of your software ready for release.

But of course the code/scripts you used in the publication need to be available right at the time of publication, even if they would, at that point, be less than useful for most researchers. It at least allows inspection for bugs and verification of results.

LikeLike

Reply
Shaun says:

February 4, 2012 at 1:00 am

This is truly annoying. In my opinion, the reviewers aren't doing their jobs if they haven't run the software in a paper like this (even if just on demo data, and yes, all computational biologists should provide an executable demo with their software).

I can point you to a Nature paper from a few years ago where software that was crucial to the findings was just described as “manuscript in preparation”. Guess what — the manuscript never appeared!

LikeLike

Reply
caseybergman says:

February 4, 2012 at 8:13 am

While I agree with Nick & Shaun that reviewers should help in the policing especially when journal guidelines are lax/ambiguous, in this case the authors (and editorial staff) are not even abiding by Science's own guidelines set out by Hanson, Sugden, Alberts in their editorial “Making Data Maximally Available”: http://www.sciencemag.org/content/331/6018/649.full

“To address the growing complexity of data and analyses, Science is extending our data access requirement listed above to include computer codes involved in the creation or analysis of data. “

LikeLike

Reply
Neil says:

February 4, 2012 at 12:13 pm

I blogged about “missing software” in papers recently. It drives me nuts. I agree that this constitutes improper reviewing and editorial practice. It's bad for science.

LikeLike

Reply
Ross Mounce says:

February 4, 2012 at 12:40 pm

This paper clearly doesn't abide by the Science Code Manifesto: http://sciencecodemanifesto.org/

I suggest everyone reads and endorses those sound principles (if they haven't already!).

Much like the Panton Principles (http://pantonprinciples.org/) they're a simple, clear set of guidelines on the use of software in academic publications.

LikeLike

Reply
Mike Taylor says:

February 4, 2012 at 3:14 pm

You think that's bad? What about Stevens, Kent A., and J. Michael Parrish. 1999. Neck Posture and Feeding Habits of Two Jurassic Sauropod Dinosaurs. Science 284:798-800 — http://www.sciencemag.org/cgi/reprint/284/5415/798.pdf

That came out thirteen years ago, and describes a then-new program for manipulating in 3d virtual models of bones — in particular, dinosaur neck bones. The software's never been released, so no-one's ever been able to even attempt replicating their results. (Disclosure: I think their results are flawed, and have published on the subject.)

LikeLike

Reply
Paul says:

February 7, 2012 at 3:23 am

This paper is mentioned in the NY Times today….alas, no mention of the not making the software available.

http://www.nytimes.com/2012/02/07/science/euryarchaeota-has-never-been-seen-but-now-its-genome-has.html?ref=science

Paul
http://www.ipscell.com

LikeLike

Reply
Titus Brown says:

February 13, 2012 at 3:36 am

I since have had the opportunity to read through the paper more thoroughly, before talking to a journalist about it. My initial impression was not quite right — they address the *scaffolding* part of assembly, in which contigs resulting from an initial round of assembly are connected together into longer ordered-and-oriented scaffolds that (in their case) appear to recover the majority of a genome. If this bears out on other data sets, this will be a very important contribution to metagenomics; previous efforts used hand-curation of contigs that looked similar based on various metrics, and something automated would be a significant advance in the field. However, they do rely on the (rather poor quality) initial assemblies coming out of Velvet prior to doing their scaffolding. I'll blog more about this after we submit our next paper addressing some of these issues. (I just don't have the time right now.)

On the flip side, if their scaffolding approach doesn't bear out on other samples, then I will be very sad…

LikeLike

Reply
Jonathan Eisen says:

February 13, 2012 at 8:38 am

Thanks for the update. Still hard to test their method if, well, they don't make the method available.

LikeLike

Reply
pyrimidine says:

September 20, 2012 at 1:43 pm

This article is worth a follow-up. We're now edging into fall and the website was supposed to have released all code by now. They have released only the first part/phase of three (and the initial code release on github hasn't seen activity since the creation of the repo), which is yet another example why we need more adherence to standards like the Science Code Manifesto and the practices outlined for the Bioinformatics Testing Consortium.

LikeLike

Reply
Felipe Leprevost says:

September 21, 2012 at 1:36 am

Some times it seems that people from biological areas still don't see software and programming codes as true scientific results or as part of the scientific method. Wetlab results are mandatory and with some level of quality, those based on computational methods can be messy, poorly described and with no control. A paper describing the results based on a new software without the software dont make sense, we can only believe in what is said.

LikeLike

Reply
Bing Ma says:

September 28, 2012 at 7:42 pm

Interesting how they took your words out of context in their “objective report”

LikeLike

Reply