Draft blog post cleanup #1: Divide and Conquer to Find Orthologs

OK – I am cleaning out my draft blog post list.  I start many posts and don’t finish them and then they sit in the draft section of blogger.  Well, I am going to try to clean some of that up by writing some mini posts.  Here is the first —

Saw an interesting paper worth checking out:
PLoS ONE: Calculating Orthologs in Bacteria and Archaea: A Divide and Conquer Approach

It describes not only a way to speed up continual ortholog annotation in bacterial and archaeal genomes but also is linked to an ongoing open code development project.

Here is the abstract:

Among proteins, orthologs are defined as those that are derived by vertical descent from a single progenitor in the last common ancestor of their host organisms. Our goal is to compute a complete set of protein orthologs derived from all currently available complete bacterial and archaeal genomes. Traditional approaches typically rely on all-against-all BLAST searching which is prohibitively expensive in terms of hardware requirements or computational time (requiring an estimated 18 months or more on a typical server). Here, we present xBASE-Orth, a system for ongoing ortholog annotation, which applies a “divide and conquer” approach and adopts a pragmatic scheme that trades accuracy for speed. Starting at species level, xBASE-Orth carefully constructs and uses pan-genomes as proxies for the full collections of coding sequences at each level as it progressively climbs the taxonomic tree using the previously computed data. This leads to a significant decrease in the number of alignments that need to be performed, which translates into faster computation, making ortholog computation possible on a global scale. Using xBASE-Orth, we analyzed an NCBI collection of 1,288 bacterial and 94 archaeal complete genomes with more than 4 million coding sequences in 5 weeks and predicted more than 700 million ortholog pairs, clustered in 175,531 orthologous groups. We have also identified sets of highly conserved bacterial and archaeal orthologs and in so doing have highlighted anomalies in genome annotation and in the proposed composition of the minimal bacterial genome. In summary, our approach allows for scalable and efficient computation of the bacterial and archaeal ortholog annotations. In addition, due to its hierarchical nature, it is suitable for incorporating novel complete genomes and alternative genome annotations. The computed ortholog data and a continuously evolving set of applications based on it are integrated in the xBASE database, available at http://www.xbase.ac.uk/.

Definitely worth checking out.

Playing around with CloVR – cloud computing bioinformatics system

Nice new tool/resource available out there for metagenomic and genomic analysis called CloVR: CloVR: A virtual machine for automated and portable sequence analysis from the desktop using cloud computing
It is available at http://clovr.org and it should be useful to many people out there doing genomics and metagenomics if you want to make use of cloud computing resources.

CloVR is brought to us by Florian Fricke and Owen White and Sam Angiuoli and others from the University of Maryland (full disclosure – many of the authors are ex-colleagues of mine from TIGR).

Not only is Clovr available openly and freely but they even have a Clovr blog: http://clovr.org/category/blog/ … though it does not seem to be heavily used.  Kudos to this team for producing and releasing this software for others to use.  And kudos to NSF, USDA and NIH for funding its development — I have a feeling many people will use it.

New paper from my lab (& the Facciotti lab): Mauve Assembly Metrics #Halophiles #Genomics

Just a quick post here. A new paper from my lab has come out in Bioinformatics. The paper is relatively simple. Titled “Mauve Assembly Metrics” it reports work of Aaron Darling and Andrew Tritt (with some minor contributions from me and Marc Facciotti). Aaron wrote the program Mauve when he was a student in Nicole Perna’s lab at Wisconsin: Mauve: multiple alignment of conserved genomic sequence with rearrangements. Over the years he (and others) have continued to develop the program and written a few papers too including for example, the development of progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. This new paper reports basically a system/scripts to measure assembly quality. Here is the abstract:

High throughput DNA sequencing technologies have spurred the development of numerous novel methods for genome assembly. With few exceptions, these algorithms are heuristic and require one or more parameters to be manually set by the user. One approach to parameter tuning involves assembling data from an organism with an available high quality reference genome, and measuring assembly accuracy using some metrics. We developed a system to measure assembly quality under several scoring metrics, and to compare assembly quality across a variety of assemblers, sequence data types, and parameter choices. When used in conjunction with training data such as a high quality reference genome and sequence reads from the same organism, our program can be used to manually identify an optimal sequencing and assembly strategy for de novo sequencing of related organisms.

Check out the paper: Mauve Assembly Metrics. Download the scripts/code http://ngopt.googlecode.com and Mauve and play around and let me know what you think.
Note this paper was supported by a grant from the National Science Foundation (ER 0949453). That grant is focused on comparative genomics (sequencing and analysis) of halophlic archaea. Stay tuned for more on that project as we are writing up a series of papers ….
Some related links:

Yes, I am a #RedSox & #PLoS fan; & this video sort of is proof #BenFranklinAward #OpenScience

Just saw this posted on Youtube.  Did not know it was coming … but am happy they recorded it

And here are the slides I used.  Will try to synch

For more on this award see

Boston, Bioinformatics & Ben Franklin Award wrap up from #BioIT11

Photo by Mark Gabrenya

Well, just got back from Boston where I went to the BioIT World convention to pick up the “Benjamin Franklin Award” for contributions to Open Science from Bioinformatics.Org.  A quick round up of the trip:

Flew to Boston early Tuesday AM.  Only thing of note – during Layover in Chicago I saw a bookstore selling autographed versions of “The Immortal Life of Henrietta Lacks” by Rebecca Skloot.

Dropped off my stuff at the Seaport Hotel – had a nice view from my room.

Called up my friend Ashlee Earl who currently works at the Broad and arranged to meet her at Kenmore Square in an hour.  I collaborated with Ashlee many years ago on analyzing her expression studies of the Deinococcus radiodurans genome and have been friends ever since. Took the T to Kenmore Square and met Ashlee and then went into this “Fenway Park” place to see a baseball game.  (I was born in Boston and am a Redsox fan …) Had the best baseball seats ever – front row Green Monster Seats – which I had bought from Stubhub.com.  Watched the Sox lose while Ashlee and I discussed Genome Centers.  I note – to those in the Broad Public Affairs office, Ashlee makes the Broad sound like a great place to work.  I tried to get some dirt out of her but she did not provide much.

Photo by Ashlee Earl

Photo by Ashlee Earl

Took the T back to the hotel after the game. And went to sleep.  Got up very early to think about my “acceptance speech” for when I was to pick up my Ben Franklin Award.  I made some quick slides on my Ipad (this was the first time I have gone to a meeting w/o my laptop) and during the talk before the award ceremony I emailed them to one of the organizers and we got things set up.

Then I was Introduced and Jeff Bizzaro read a mini statement about why I won the award.  Something like what they put on the Bioinformatics.org web site:

Jonathan uses his high visibility in social media to advocate for open access by sharing links to discussions, mentioning open access articles and initiatives, and pushing for the opening up of popular closed access articles. This culture is shared with his students, who advocate for “open access” peer reviewing and created a peer-to-peer service for sharing bioinformatics material (articles, software and datasets). He is the academic editor in chief of PLoS Biology [1] and voices his opinions and support for open access publication and open data sharing on his “Tree of Life” blog [2]. In addition to just voicing his opinion, he also practices what he preaches, by refusing to publish in non-open access journals. With respect to bioinformatics, he has been involved with many software packages that are freely available, such as the recent AMPHORA [3] and PhyloOTU [4]. Lastly, Jonathan helped release a new open data sharing tool for scientists called BioTorrents [5]. This is just another step in encouraging all scientists to share their data and results more openly.


1. http://www.plosbiology.org/ 

2. http://phylogenomics.blogspot.com/ 

3. http://genomebiology.com/2008/9/10/R151 

4. http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1001061 

5. http://www.plosone.org/article/info:doi/10.1371/journal.pone.0010071

Note – am proud to get this award.  It is given for contributions to Open Science and previous winners are an esteemed crew: Michael Eisen (my brother), Alex Bateman, Michael Ashburner, Jim Kent, Robert Gentleman,  Phil Bourne, Lincoln Stein, Ewan Birney, and Sean Eddy (see full details here).

Then I gave my mini talk focusing on a brief history of how I got into Open Science.  Here are my slides

Note the awkward typo where I introduced the “Public Library of Science”. Oops.  Anyway – talked for a few minutes.  While wearing my RedSox PLoS 1 shirt I note.

Photo by Jeff Bizarro
Photo by Mark Gabrenya

Photo by Mark Gabrenya

 And then there was a break. They took some pictures during the break and eventually I wandered around to the booths.

Photo by Mark Gabrenya

Photo by Mark Gabrenya

Photo by Mark Gabrenya

I saw Nat Pearson who now works for Knome and I went to lunch with him to discuss my “Exome” which Knome has sequenced.  I note, Nat was a student in a class I TAd at Stanford — good to see how far he has come.

 And then back to the meeting where I wandered around again for a while.  Saw an old friend from TIGR Xiaoying Lin who now works at Life Technologies and discussed the Ion Torrent with him.

Was pleased to see a booth giving away free RedSox tickets as a prize.

Then I headed out to Brookline for dinner with my Aunt and Uncle and cousin and eventually made my way back to the hotel where I had a few drinks.

The next day I got up a bit late, and eventually made my way to Logan Airport where the trip home was a disaster.  My outbound flight was late.  Missed my connection.  Then the flight I was on was held up for others to make their connection.  Though I did get a few hours in Denver Airport to wander around.  Got home after 1 AM …  And finally made it home.

Tracing the evolutionary history of Sarah Palin: links to a parasitic nematode and the pathogenic fungus Botryotinia fuckeliana

You see, as a total sequence analysis dork, when I see names, I frequently ask whether the letters in the name include only letters which are used as amino acid abbreviations. I started this game when the brilliant notes/letters came out in Science in the early 90s about whether ELVIS was overrepresented in protein sequences. Of course, despite being 20 years old, Science still keeps these under wraps requiring registration to see them (see for example the Stevens letter).

Anyway, alas, three of the major candidates for the US election have names that do not use traditional amino acid abbreviations so I am stuck with analyzing Sarah Palin. But that is OK because of her professed aversion to evolution and support to Creationism (and since sequence analysis is inherently an evolutionary study).

So – I took here name and went to the NCBI Blast page and did some searches. And what came up? Well, here are some of the top hits from the blastp searches (which I used to compare the pretend peptide “SARAHPALIN” with all the peptides in the non redundant collection at Genbank).

>ref|XP_001545292.1| Gene info hypothetical protein BC1G_16161 [Botryotinia fuckeliana B05.10]
gb|EDN25226.1| Gene info predicted protein [Botryotinia fuckeliana B05.10]

GENE ID: 5425746 BC1G_16161 | hypothetical protein
[Botryotinia fuckeliana B05.10]

Score = 26.9 bits (56), Expect = 189
Identities = 8/9 (88%), Positives = 8/9 (88%), Gaps = 0/9 (0%)

Sbjct 209 SARAQPALI 217

>ref|YP_061725.1| Gene info homoserine dehydrogenase [Leifsonia xyli subsp. xyli str. CTCB07]
gb|AAT88620.1| Gene info homoserine dehydrogenase [Leifsonia xyli subsp. xyli str. CTCB07]

GENE ID: 2939000 thrA | homoserine dehydrogenase
[Leifsonia xyli subsp. xyli str. CTCB07] (10 or fewer PubMed links)

Score = 26.9 bits (56), Expect = 189
Identities = 8/9 (88%), Positives = 8/9 (88%), Gaps = 0/9 (0%)

Sbjct 267 SARVHPALI 275

>ref|ZP_02031476.1| hypothetical protein PARMER_01474 [Parabacteroides merdae ATCC
gb|EDN87136.1| hypothetical protein PARMER_01474 [Parabacteroides merdae ATCC

Score = 26.1 bits (54), Expect = 340
Identities = 7/8 (87%), Positives = 8/8 (100%), Gaps = 0/8 (0%)

Query 3 RAHPALIN 10

Sbjct 170 RAHPALVN 177

>ref|XP_567332.1| Gene info hypothetical protein CNJ01520 [Cryptococcus neoformans var. neoformans
ref|XP_773201.1| Gene info hypothetical protein CNBJ1950 [Cryptococcus neoformans var. neoformans
gb|EAL18554.1| Gene info hypothetical protein CNBJ1950 [Cryptococcus neoformans var. neoformans
gb|AAW45815.1| Gene info hypothetical protein CNJ01520 [Cryptococcus neoformans var. neoformans

GENE ID: 3254188 CNJ01520 | hypothetical protein
[Cryptococcus neoformans var. neoformans JEC21] (10 or fewer PubMed links)

Score = 26.1 bits (54), Expect = 340
Identities = 8/9 (88%), Positives = 8/9 (88%), Gaps = 0/9 (0%)

Sbjct 415 SARQHPALI 423

>ref|YP_001626035.1| Gene info citrate synthase [Renibacterium salmoninarum ATCC 33209]
gb|ABY24621.1| Gene info citrate synthase [Renibacterium salmoninarum ATCC 33209]

GENE ID: 5822379 RSal33209_2898 | citrate synthase
[Renibacterium salmoninarum ATCC 33209]

Score = 25.7 bits (53), Expect = 456
Identities = 9/11 (81%), Positives = 9/11 (81%), Gaps = 2/11 (18%)

Query 1 SARAHP--ALI 9
Sbjct 218 SARAHPYAALI 228

>ref|YP_001817256.1| Gene info integral membrane sensor hybrid histidine kinase [Opitutus terrae
gb|ACB73656.1| Gene info integral membrane sensor hybrid histidine kinase [Opitutus terrae

GENE ID: 6208547 Oter_0366 | integral membrane sensor hybrid histidine kinase
[Opitutus terrae PB90-1]

Score = 25.2 bits (52), Expect = 611
Identities = 7/7 (100%), Positives = 7/7 (100%), Gaps = 0/7 (0%)

Query 3 RAHPALI 9
Sbjct 256 RAHPALI 262

>ref|YP_001757871.1| Gene info putative anti-sigma regulatory factor, serine/threonine protein
kinase [Methylobacterium radiotolerans JCM 2831]
gb|ACB27188.1| Gene info putative anti-sigma regulatory factor, serine/threonine protein
kinase [Methylobacterium radiotolerans JCM 2831]

GENE ID: 6141303 Mrad2831_5232 | putative anti-sigma regulatory factor,
serine/threonine protein kinase [Methylobacterium radiotolerans JCM 2831]

Score = 25.2 bits (52), Expect = 611
Identities = 7/8 (87%), Positives = 8/8 (100%), Gaps = 0/8 (0%)

Query 2 ARAHPALI 9
Sbjct 299 ARAHPALV 306

>ref|ZP_01466013.1| hydrolase, TatD family [Stigmatella aurantiaca DW4/3-1]
gb|EAU63211.1| hydrolase, TatD family [Stigmatella aurantiaca DW4/3-1]

Score = 25.2 bits (52), Expect = 611
Identities = 7/7 (100%), Positives = 7/7 (100%), Gaps = 0/7 (0%)

Query 3 RAHPALI 9
Sbjct 79 RAHPALI 85

>ref|YP_001558323.1| Gene info glycosyl transferase group 1 [Clostridium phytofermentans ISDg]
gb|ABX41584.1| Gene info glycosyl transferase group 1 [Clostridium phytofermentans ISDg]

GENE ID: 5743305 Cphy_1206 | glycosyl transferase group 1
[Clostridium phytofermentans ISDg]

Score = 25.2 bits (52), Expect = 611
Identities = 8/10 (80%), Positives = 8/10 (80%), Gaps = 0/10 (0%)


Sbjct 113 SERAHPLLIN 122

There does not appear to be a perfect match in the NCBI NR protein database. But take a close look at the #1 scoring hit. That is right, it is from and organism called Botryotinia fuckeliana. No comment on the appropriateness of this name, but it does contain a term I will probably use a lot if she gets elected.

Of course, anybody who has heard me blather on and on about evolution knows that I am always talking about how blast top hits are not a good measure of relatedness per se (see my NAR paper where I first talked about this in 1995). So – I decided to build a tree of Sarah Palin. I used the NCBI Distance Tree option which you can do from blast searches.

Since most likely you cannot see that in enough detail – here is a zoom in.

That one did not come through on the Blog so well either so I decided to output the tree in Newick format and then I searched for a program that could draw a better figure on the web (we have tools in my lab to do this but I am trying to do this all on the web as an exercise). And I found a web site that makes drawtree available. And I plugged in the Newick format and it made a nicer one.

Though making trees from really short sequences is not ideal, in this tree, Sarah Palin is shown to be at the root of a branch including a protein from the parasitic nematode Brugia malayi. So if we take an evolutionary interpretation it seems that this causative agent of filariasis (well, a protein from this agent) is descended from SarahPalin. In other words, she seems to be ancestral to this parasite.

So in conclusion – by similarity – SarahPalin is closest to a plant pathogen with an unusual name. And by phylogeny SarahPalin is ancestral to a parasitic nematode. Sounds about right.

Connection between Video Games and Bioinformatics?

The Scientist Magazine has a nice piece on one of my favorite people in all of Science – Sean Eddy. In the article, they discuss how Sean is one of those bioinformatics folks who does not just hack together some code to do something but actually writes really good code for his programs. For those of you who do not know, Sean has made a whole collection of software tools for biologists (see his web site here). Perhaps the most widely used is HMMER, which is designed for making and using hidden markov models. But there are some other good ones he has put out. My favorite is Forester, which was made by Christian Zmasek in his lab and is supposed to be available here, although the site is not working right now (NOTE – Christian has posted a new link for it in the comments). I like this because, well, it is software for “phylogenomic” analysis.

Anyway – it is a nice article about Sean, especially the parts talking about how his background in video games contributed to his success in bioinformatics. Back to something I said above, Sean is without a doubt one of my favorite people in science. There are many reasons for this but here are a few.

  • He is very open with ideas.

    Once, at a conference, I gave a talk on this bizarre new pattern we had found when we were comparing the genomes of E. coli and V. cholerae. We had found that when we did genome-level alignments of these species there was an X-like pattern (see our paper on this here). Anyway, in the talk I said something to the effect of “we have no friggin idea how these X-like alignments could be generated” And Sean, I think in the quesiton session, pointed out that in another paper of ours we had seen what appeared to be symmetric inversions occurring around the origin of replication and that could create the X-alignment. And lo and behold he was right. We got the paper, but in a large part it was his push that got us looking at the inversions sooner than we would have.

  • He is very open with science.

    Most of Sean’s work is on the open side of science. Open Source software. Open Access publications. Open everything. And I should point out that it was a talk by Sean that catalyzed my conversion into an Open Science supporter. I was attending a meeting in Ft. Lauderdale to discuss data release policies for genome projects. This meeting led to the “Ft Lauderdale Agreement” on data release, by the way. A the meeting there were many genomics players like Eric Lander and Francis Collins who were trying to push for not completely open data release policies where genome centers could release data but there would be constraints placed on the use of the data so that the genome centers would be the first to be able to publish genome scale analysis of an organisms genome sequence.

    At the time I was working at TIGR and I supported this notion of basically letting people search for a few genes of interest but preventing them from doing genome analyses. And then Sean got up and gave a talk and, well, blew my mind. I am sure I have notes somewhere from the meeting but basically what he said was – the genome projects whole point is to generate genome data for people to do genome-level analysis. So how on earth can we justify preventing exactly the type of analysis that the projects were designed to generate. He was not saying that we should not somehow protect the genome centers. What he was saying was that for the benefit of science, we need to find a way to allow people to do genome-level analyses immediately on the data. And he also said that the risks of releasing ones data with no restrictions are much less than everyone claims. I think he convinced many people that genome centers needed to open up their data release policies a bit more. And he convinced me.

    And so I went home from that meeting and decided to release the data from as many of my genome projects as I could, with NO restrictions (e.g., this is what we did with Tetrahymena). And also, this new found belief in openness helped pave the way for my conversion to being an Open Access publishing supporter.

Anyway, glad to see Sean getting positive press. It is well deserved. Now off to play some video games.

Tony Hey visits U. C. Davis

Just got back from a dinner with Tony Hey, who was visiting UC. Davis to give a talk and meet with various people. Hey is currently VP for technical computing at some place called Microsoft. Hey has done some pretty interesting things in his career but what I know him from is his time as the head of the “E-science” initiative in the UK. Before I blather on about this … check out Timo Hannay’s blog about Hey’s visit to Nature which has a pseudo outline of his talk he gave there.

It is interesting to see Microsoft getting into collaborative science — I hope they stay serious about it because we need more “top down” types of efforts are big places like Microsoft. Whether Microsoft could make much money out of contributing to science I do not know, but if they put 1/1000 of the effort into this as they do into games and Office, science would almost certainly benefit. Many years ago when I was at TIGR, some Microsoft folks came to visit (when genome-stocks were going crazy) and expressed an interest in getting more involved in bioinformatics and genomics. Looks like that did not go anywhere. Maybe now is the time to try to get them doing this again?

I know Microsoft is viewed as Evil incarnate by many academics but hey (no pun intended), given the cool stuff being done by the Gates Foundation in various areas of science, maybe Microsoft will move a little more into science if only to support Gates Foundation efforts. Certainly, Tony Hey’s background suggests that they have the potential to do some interesting stuff.