Please – bash my latest paper – for the benefit of humanity

My lab has a new paper that just came out on the sequencing and analysis of the genome of a pretty cool (or hot actually) bacterium, Thermomicrobium roseum, which was isolated from a Toadstool Spring, an alkaline siliceous hotspring in Yellowstone National Park. This paper is from a grant we had when I was at TIGR as part of the “Assembling the Tree of Life” program at NSF. Our grant was focused on generating genome sequences from phyla of bacteria for which no genomes were available.

At the time this species was a representative of a phylum that had no genomes. After we started sequencing, the phylum was dissolved, but never mind that for now. We report what I think are some very interesting things in the paper. Among them:
  • We report the first example of a plasmid that encodes all the genes needed for chemotaxis including all the genes for making a flagellum. Given that they are on a plasmid this suggests that motility could be easily transfered between species.
  • We report experimental work and genome analysis that helps understand the novel membrane and cell wall structure in this species.
  • This is the first thermophile known to oxidize carbon monoxide
But I am not writing per se about the things I like about our paper. I am instead asking people out there to find things wrong with our paper. Why am I doing this? Because this paper is part of a broader experiment in publishing in that it is in PLoS One. And one of the main benefits of PLoS One is the features that allows commenting on publications. I personally believe such features are part of the future of scientific publication. But it is currently unclear just how effectively such commenting features are used (note Euan Addie is doing a survey about comments on PLoS One papers here).

So I am offering up my paper as a case study. If you comment and ask questions or make critiques, I will try to respond. And if you think something in our paper is wrong or weird, please say so. If you think something in our paper is supported by other work we do not cite, please say this too. If you have anything useful to say, please make comments.

How do you do this?

  • Go to the paper at the PLoS One Web Site.
  • In the upper right click on “Login” if you have an account or “Create account” if you do not.
  • Return to the paper once you are logged in
  • Find some part of the text you want to comment on
  • Highlight that text and click over on the right “Add a note” or “Make a comment”
  • Fire away.

Harold Varmus on Science Friday

There was a very interesting interview on Science Friday last week.  The discussion was with Harold Varmus (see Science Friday Archives: Harold Varmus).
In the interview, Varmus discussed his new book, his role as an advisor to Obama, and some issues relating to Open Access.  I found his comments to be very interesting and insightful and it is worth listening to.  

Pictures from Yolo Basin – Bitterns, Night Herons, Owl

Yolo Basin

Nice little PLoS reference by Nicholas Kristof in the Times …

Just a quick one here. In an article in today’s Times, Nicholas Kristof writes about “Putting Torture Behind Us” and he has a little PLoS reference there …

“Granted, returning the base to Cuba may not be politically realistic. So here’s a fallback alternative: turn the base into a research center for tropical diseases.

This was proposed in a medical journal, PLoS Neglected Tropical Diseases, a year ago, and it makes more sense now than ever

In Latin America and the Caribbean, there are still more than half-a-million cases annually of dengue fever (which causes excruciating pain and sometimes death), nearly 50,000 new cases of leprosy and more than 700,000 cases of elephantiasis (which causes grotesque deformities). In addition, 50 million Latin Americans have hookworms inside them, often causing anemia and making it more difficult for children to concentrate in school.

Peter Hotez, the president of the Sabin Vaccine Institute at George Washington University and the editor of PLoS Neglected Tropical Diseases, says that an international center on Guantánamo could become a symbol of United States cooperation in the region.

Imagine if people around the world came to think of Guantánamo as a place where America led a battle against hookworms and leprosy. That would help us fight terrorism far more effectively than the prison at Guantánamo ever did..”

Hat tip to Chris Schelleng for pointing this out. The original PLoS NTD article by Peter Hotez is here.

Benefits of Open Access: enabling musical interpretations of human genomics …

Not this is one of the most creative uses of open access science publications I have seen in a while. The video is from a paper by Dan Falush and colleagues that was in PLoS Genetics. Listen/see how the music changes with the genetics/migration of humans.

So I guess given some of my recent posts, we must ask what should we call this? Musicomics (which has a following but most of the use of the term seems to refer to music and comics together, although I did find one site with a reference that is about genomics) or genomusic (most of which seems to refer to people named Geno making music). Maybe, maybe, we just should say it is “nameless” but nice.

Anyway — a nice use of open access — the material from the PLoS Genetics paper is under a broad Creative Commons license and thus this type of use is allowed (and the source is attributed in the YouTube notes). Not sure about the exact details of the origins of the music for the video, but Dan Falush has hinted to me that it was some spontaneous contribution by a band in LA.

Obama’s Science Team Big on Evolution

Much has been written and will be written about how Obama is taking science seriously.  To me, one great sign of this is that not only is evolution OK to talk about now, but – gasp – many of his science team actually have worked on evolution.   For example:

  • Eric Lander, part of Obama’s council of advisors on science and technology, has written many papers either directly or indirectly about evolution. 
  • Harold Varmus also on this Council, has written about evolution of viruses (e.g. here),
  •  Jane Lubchenco is an ecologist who in much of her work has an evolutionary ecology angle
Even John Holdren, who is more of a physicist and as far as I can tell has not written explicitly about evolution recently certainly discussed it in some of his earlier publications with Paul Ehrlich.  
So – not only is science in general and life science in particular on the upswing.  But evolution is too.  Maybe this is why Darwin endorsed Obama so many months back. 

Calling all phylogeneticists – we need your help with metagenomic data

I have decided to post a question here to my blog requesting help from phylogeneticists everywhere in doing phylogenetic analysis of data from metagenomic projects. Here I will try to describe the problem and then hopefully people out there can chime in on what they think we/others should do to handle this type of data.

So here is the deal. We would like to perform a variety of phylogenetic analyses of data from “environmental shotgun sequencing (ESS)” projects in which one isolates DNA from an environmental sample (e.g., soil, water) and then randomly sequences fragments of that DNA. ESS is in essence a subset of “metagenomics” which is basically the study of the genomes of organisms from environmental samples. (I wrote a brief piece on ESS in PLoS Biology last year which can be found here).

Though there are lots of things we would like to do with phylogenetic analysis of this type of data, I am going to focus here on one specific thing. We would like to take sequence reads that contain matches to specific gene/gene family (e.g., RecA, my favorite gene), build a multple sequence alignment that includes these reads as well as all members of this gene family from known organisms, and then build phylogenetic trees from these alignments. (And by we here I mean like totally lots of people, incliding in particular a Gordon and Betty Moore Foundation funded project called iSEEM I am working on with the labs of Katie Pollard and Jessica Green)

The challenge with this is really two things. First, we want to analyze just the reads themselves (i.e., we do not want to use assemblies you can make from this type of data). Second, and more importantly, we want to include in our analysis sequence reads that only cover small, not necessarily overlapping regions of the “full length” sequence alignments for the family.

The alignment would look something like

    sequence 1 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 1 XXXXXXXXX————————-
    sequence 2 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 2 ———XXXXXXXXXXXX————-
    fragment 3 ———————XXXXXXXXXXXXX
    fragment 4 —-XXXXXXXXXXXXXXXXXX————
    sequence 3 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    sequence 4 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    sequence 5 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 5 ———————–XXXXXXXXXXX- 
    where Xs are the regions covered by the sequences/fragments (could be DNA or amino acids)

We want to build trees from these alignments with the hope of using them to learn lots of cool things about the evolution of the fragments and the species from which they come. I can provide more information but really the key part for the phylogenetics here is the nature of the alignment.

In the past, I have decided to constrain my analyses to NOT deal with this type of alignments. I have either analyzed each fragment on its own or we have built a multiple alignment but only inlcuded fragments that cover more than 3/4 of the full length sequence and thus the matrix is much more filled out. Such an alignment would look like this

    sequence 1 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 1 XXXXXXXXXXXXXXXXXXXXXXXXXXX——-
    sequence 2 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 2 –XXXXXXXXXXXXXXXXXXXXXXXX——–
    fragment 3 —–XXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 4 —-XXXXXXXXXXXXXXXXXXXXXXXXXXXX–
    sequence 3 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    sequence 4 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    sequence 5 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 5 –XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX- 

But we really want to include the smaller fragments in our analysis. And we are just not certain how to best do this. We know LOTs of people out there think of similar problems in terms of sparse matrices, supermatrices, supertrees, EST data, etc. And we have ideas about how to do this and are asking around by email some phylogenetics gurus we know. But I thought it might be fun to have the discussion on a blog rather than by email.

So again, how might one best build phylogenetic trees from data that looks like this?

    sequence 1 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 1 XXXXXXXXX————————-
    sequence 2 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 2 ———XXXXXXXXXXXX————-
    fragment 3 ———————XXXXXXXXXXXXX
    fragment 4 —-XXXXXXXXXXXXXXXXXX————
    sequence 3 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    sequence 4 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    sequence 5 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    fragment 5 ———————–XXXXXXXXXXX- 

And from these trees we want to place each fragment relative to (1) the full length sequences and (2) to each other if possible. We also, of course, want branch lengths to reflect some sort of amount of evolution and thus do not just want a cladogram.

Any suggestions would be appreciated. Fire away with questions too …

Computational Biologists bring together some of the "best jobs" in the US

Check out the WSJ article on “The Best and Worst Jobs in the U.S. – WSJ.com”

The top 5 are
1. Mathematician
2. Actuary
3. Statistician
4. Biologist
5. Software Engineer

Seems like I know a fair # of people who combine #1, #3, #4 and #5 although rarely in equal amounts. I am not sure if anyone out there combines all of the top 5 but there must be some scientist/actuaries doing this computational biology right? Seems like a pretty strange list to me in some ways, but I must say, being a mathematically inclined computational biologist is pretty fun. Now if I only knew statistics …

Hat tip to Lior Pachter for posting this to Facebook where I found it.

Acid Rock Bacteria Genome …

Just a little plug for a new paper of which I am a co-author. This is a report on the analysis of the genome sequence of Acidithiobacillus ferrooxidans which was just published in BMC Genomics (an open access journal, by the way). This paper was a long long time coming – the genome was sequenced when I was at TIGR many years ago (Herve Tettelin coordinated most of the work). Since I was interested in the biology of this bug I volunteered to help turn the sequence into a paper, but was pretty lame about doing that. Thankfully David Holmes and Jorge Valdes in Chile helped make a paper from the data and much additional analyses. Here is the abstract:

Background
Acidithiobacillus ferrooxidans is a major participant in consortia of microorganisms used for the industrial recovery of copper (bioleaching or biomining). It is a chemolithoautrophic, γ-proteobacterium using energy from the oxidation of iron- and sulfur-containing minerals for growth. It thrives at extremely low pH (pH 1–2) and fixes both carbon and nitrogen from the atmosphere. It solubilizes copper and other metals from rocks and plays an important role in nutrient and metal biogeochemical cycling in acid environments. The lack of a well-developed system for genetic manipulation has prevented thorough exploration of its physiology. Also, confusion has been caused by prior metabolic models constructed based upon the examination of multiple, and sometimes distantly related, strains of the microorganism.

Results
The genome of the type strain A. ferrooxidans ATCC 23270 was sequenced and annotated to identify general features and provide a framework for in silico metabolic reconstruction. Earlier models of iron and sulfur oxidation, biofilm formation, quorum sensing, inorganic ion uptake, and amino acid metabolism are confirmed and extended. Initial models are presented for central carbon metabolism, anaerobic metabolism (including sulfur reduction, hydrogen metabolism and nitrogen fixation), stress responses, DNA repair, and metal and toxic compound fluxes.

Conclusion
Bioinformatics analysis provides a valuable platform for gene discovery and functional prediction that helps explain the activity of A. ferrooxidans in industrial bioleaching and its role as a primary producer in acidic environments. An analysis of the genome of the type strain provides a coherent view of its gene content and metabolic potential.

Stan Falkow, only 74 and getting more famous by the day

Nice article in USA Today about Stan Falkow focusing in part on his Lasker Award. Good to see him continue to get some props as he, well, rocks. Note – I wrote about him getting a Lasker Award four months ago here but maybe I was too early?