genomes – Jonathan Eisen's Lab

Everything You Wanted to Know about the Lake Arrowhead Microbial Genomes meeting #LAMG14

The Lake Arrowhead Microbial Genomes meeting, which happens every other year, is starting tonight. I love this meeting. No bias here since I am now a co-organizer. But I really love this meeting. I am posting here some background information about the meeting for those interested. We will be live tweeting the meeting using the hashtag #LAMG14. This years program is here.

Posts of mine about previous meetings

March 02, 2014: Save the dates / preliminary program for Lake Arrowhead Microbial Genomes Meeting
July 11, 2012: Lake Arrowhead Microbial Genomes 2012 #Genomes #Microbes #Mountains #Lake #Fun #Wine #MustGo
September 22, 2012: Lake Arrowhead Microbial Genomes Meeting 2012 Speaker Gender Ratio #LAMG12
September 22, 2012: Storify for Lake Arrowhead Microbial Genomes #LAMG12 Meeting
October 28, 2010: The Story behind the Meeting: Lake Arrowhead Microbial Genomes 2010
September 14, 2008: It’s Miller Time – Lake Arrowhead Microbial Genomes Conference — about to begin
September 16, 2008: Lake arrowhead notes – UPDATED
September 29, 2006: Genomics Education highlighted at 14th Annual International Meeting on Microbial Genomics

Blog posts by others

December 23, 2012: Srijak Bhatnagar: Lessons from 2012: Lake Arrowhead Microbial Genomes
September 21, 2008: Morgan Langille Review of Arrowhead Conference

Programs and notes from past meetings

Meeting Web Sites

Microbial Genomics – ARROWHEAD 2012

I have uploaded slides from my previous presentations at the meeting

Jonathan Eisen @phylogenomics talk for #LAMG12 from Jonathan Eisen

Jonathan Eisen talk on “The Importance of History” at Lake Arrowhead Small Genomes Meeting 2010 from Jonathan Eisen

Jonathan Eisen talk on “Genomic Encyclopedia” at Lake Arrowhead Small Genomes Meeting 2008 from Jonathan Eisen

Jonathan Eisen talk on “Enodsymbiont Genomics” at Lake Arrowhead Small Genomes Meeting 2006 from Jonathan Eisen

Jonathan Eisen talk on “Phylogenomics of Microbes” at Lake Arrowhead Small Genomes Meeting 2004 from Jonathan Eisen

Jonathan Eisen talk on “Phylogenomics of Microbes” at Lake Arrowhead Small Genomes Meeting 2002 from Jonathan Eisen

Jonathan Eisen talk on “Phylogenomics of DNA repair” at Lake Arrowhead Small Genomes Meeting 2000 from Jonathan Eisen

Question – anyone having issues w/ delays/difficulty in the process of getting genomes / metagenomes into Genbank?

DNA sequencing continues to go crazy in terms of lower cost, higher speed, and spread of technology. Alas, some aspects of doing a genome project are not necessarily keeping up. So I am posting here to ask a simple question about one of these steps. What do people out there think about the steps of getting genome / metagenome data into Genbank. Without wanting to bias answers too much – we are having some challenges in this area. Storify of Twitter responses below the fold

//storify.com/phylogenomics/a-work-around-for-genbank-bottleneck.js[View the story “A work around for Genbank bottleneck” on Storify]

Nice timing: Our paper on the Darwin’s Finch genome is out today on Darwin’s birthday

Birthday party for Darwin in 2009

Well, I assume this was on purpose from the folks at Biomed Central but not sure. Our paper on the genome of one of Darwin’s Finches is out today in BMC Genomics: BMC Genomics | Abstract | Insights into the evolution of Darwin’s finches from comparative analysis of the Geospiza magnirostris genome sequence.

Abstract of the paper:

Background
A classical example of repeated speciation coupled with ecological diversification is the evolution of 14 closely related species of Darwin’s (Galápagos) finches (Thraupidae, Passeriformes). Their adaptive radiation in the Galápagos archipelago took place in the last 2–3 million years and some of the molecular mechanisms that led to their diversification are now being elucidated. Here we report evolutionary analyses of genome of the large ground finch, Geospiza magnirostris.
Results
13,291 protein-coding genes were predicted from a 991.0 Mb G. magnirostris genome assembly. We then defined gene orthology relationships and constructed whole genome alignments between the G. magnirostris and other vertebrate genomes. We estimate that 15% of genomic sequence is functionally constrained between G. magnirostris and zebra finch. Genic evolutionary rate comparisons indicate that similar selective pressures acted along the G. magnirostris and zebra finch lineages suggesting that historical effective population size values have been similar in both lineages. 21 otherwise highly conserved genes were identified that each show evidence for positive selection on amino acid changes in the Darwin’s finch lineage. Two of these genes (Igf2r and Pou1f1) have been implicated in beak morphology changes in Darwin’s finches. Five of 47 genes showing evidence of positive selection in early passerine evolution have cilia related functions, and may be examples of adaptively evolving reproductive proteins.
Conclusions
These results provide insights into past evolutionary processes that have shaped G. magnirostris genes and its genome, and provide the necessary foundation upon which to build population genomics resources that will shed light on more contemporaneous adaptive and non-adaptive processes that have contributed to the evolution of the Darwin’s finches.

Figure 1

There is a long long long story behind this paper. Too long for me to write up right now. I wrote up some of the story for a Figshare posting of the genome data last year.

“Darwin’s Finches” are a model system for the study of various aspects of evolution and development. In 2008 we commenced on a project to sequence the genomes of some of these species – inspired by the (then) upcoming celebration of the 200th anniversary of the birth of Charles Darwin (which was in February 2009). The project started with a brief discussion at the AGBT meeting in 2008 and then via an email conversation between Jonathan Eisen and Jason Affourtit about the possibility of a collaboration involving the 454 company (which was looking for projects to highlight the power of it’s then relatively new 454 sequencing machines). After further discussions between Jonathan Eisen, his brother Michael Eisen (who separately had become interested in Darwin’s finches) and people from 454 it was decided that this was a potentially good project for a scientific and marketing collaboration.

In these conversations it was determined that the most likely limiting factor would be access to DNA from the finches. This was largely an issue due to the fact that the Galapagos Islands (where the finches reside) are a National Park in Ecuador and also a World Heritage site. Collection of samples there for any type of research is highly regulated. Thus, Jonathan Eisen made contact with Peter and Rosemary Grant – the most prominent researchers working on the finches – and who Eisen had discussed sequencing the finch genomes in the early 2000s. In that previous conversation it was determined that the sequencing would be too expensive to carry out without a major fundraising effort. However, with the advent of “next generation” sequencing methods such as 454 the total costs of such a project would be much lower.

In the conversations with the Grants, the Grants offered to ask around to see if anyone had sufficient amounts of DNA (or access to samples), which would be needed for genome library construction. Subsequently they identified Arkhat Abzhanov from Harvard as someone who likely had samples as well as permission to do DNA-based work on them, from many of the finch species. Abzhanov offered to provide samples from three key species (large ground finch Geospiza magnirostris, large cactus finch G. conirostris and sharp-billed finch G. difficilis) and DNA was sent to Roche-454 for sequencing in July of 2008. In August, the first “test” sequence data was provided from Geospiza magnirostris. A plan was then made to generate additional data and Roche offered to do the sequencing at their center at a steep discount. Funds were raised by Jonathan Eisen, Greg Wray, Monica Riley, and others to pay for the sequencing and over the next year or so, three sequencing bursts were conducted at Roche-454. “

That is a decent summary of the background. The details on the science are in the paper. What the background does not say is that the project languished for years as we did not have funds to support the actual analysis of the genomes and it was kind of out of my normal area of expertise. Along the way, I did a poor job of communicating with some of the initial parties in the project (e.g., I did a really bad job of communicating with Greg Wray – who had provide some of the funds – and I will forever be trying to make things up to him). Anyway, thankfully Arhat eventually pulled together a group of people led by Chris Ponting to help analyze the genome and Chris led the way to the paper that is out today. Only four years after our original goal.

I have been a birder and an evolutionary biologist for many many many years. Thus this is kind of a cool project for me. When I was in the Galapagos in 2002 I dreamed of doing a project like this – and even started doodling Darwin’s finches all over the place – including on some of the styrofoam cups we sent down to the bottom of the ocean on the outside of the Alvin sub as part of a deep sea research cruise I went on. See below:

https://picasaweb.google.com/s/c/bin/slideshow.swf

Add caption

Some related posts:

From 2002

Me, in the Galapagos in 2002

Me in the Galapagos in 2002

Me, today, w/ Darwin’s finch art

Rhodopsins Rhodopsins everywhere …

Was browsing through this paper (largely due to my interest in sequencing genomes of novel organisms): Genome Biology | Abstract | Genome of Acanthamoeba castellanii highlights extensive lateral gene transfer and early evolution of tyrosine kinase signaling.

And I found they found something very interesting. “We identified two rhodopsins both with C-terminal histidine kinase and response regulator domains with homology to the sensory rhodopsins of the green algae that represent candidates for light sensors in Ac (Figure 3).” Seems they found some homologs of the proteorhodopsin / halorhodopsin family of proteins which I have been interested in for years. Check out Figure 3:

Every couple of months there is a new group of organisms that is found to have a member of this gene family. See for example: Sequencing of Seven Haloarchaeal Genomes Reveals Patterns of Genomic Flux and Genome sequence of the Antarctic rhodopsins-containing flavobacterium Gillisia limnaea for papers in which I was involved where Rhodopsins were part of the story. Also see the Venter et al. Sargasso paper: PDF. Anyway – just a quick post for those out there interested in rhodopsins and the like …

Welcome to the Microbial Earth Project

Map of type strains.

All interested in microbes and their genomes should check out The Microbial Earth Project. It “is an international effort to generate a comprehensive catalog from genome sequences of all the archaeal and bacterial type strains. The name of the project comes from the recognition that Earth is a predominantly a microbial planet, and by effect in order to understand life on our planet, we need to understand how microbial life works.”

There are some 10,000 described type strains of bacteria and archaea. Not really a lot given that there are probably millions upon millions of species of bacteria and archaea. But it is what we have available to us in terms of the formally described and accepted species for which there is an available cultured strain.

At this site you can do things like “Adopt a Type Strain” or view a cool “Map of the type strains“.

The Steering Committee for the project is

Jonathan Eisen (University of California, Davis)
George Garrity (Names for Life, USA)
Philip Hugenholtz (Australian Centre for Ecogenomics Research, Australia)
Hans-Peter Klenk (DSMZ, Germany)
Nikos C. Kyrpides (DOE-Joint Genome Institute, USA)
William B. Whitman (University of Georgia)
Tanja Woyke (DOE-Joint Genome Institute, USA)

Much of the real work being done by Nikos Kyrpides, George Garrity, and others though I am very pleased to be a member of the Steering Committee. One of my key jobs will be to get the word out early and often. Hence this post.

Storify for Lake Arrowhead Microbial Genomes #LAMG12 Meeting

Meeting went well. Here is a storification of it:

http://storify.com/phylogenomics/lake-arrowhead-microbial-genomes-2012-lamg12.js?template=slideshow[View the story “Lake Arrowhead Microbial Genomes 2012 #LAMG12” on Storify]

A blast from the past: Plasmodium, plastids, phylogeny, and reproducibility

A few days ago I got an email from a colleague who I had not seen in many years. It was from Malcolm Gardner who worked at TIGR when I was there and is now at Seattle Biomed.

His email was related to the 2002 publication of the complete genome sequence of Plasmodium falciparum – the causative agent of most human malaria cases – for which he was the lead author. Someone had emailed Malcolm asking if he could provide details about the settings used in the blast searches that were part of the evolutionary analyses of the paper. The paper is freely available at Nature – at least for now – every once in a while the Nature Publishing Group seems to put it behind a paywall despite their promises not to.

Malcolm was contacting me because I had run / coordinated much of the evolutionary analysis reported in that paper. I note – as one of the only evolution focused people at TIGR it was pretty common for people to come to me and ask if I could help them with their genome. I pretty much always said yes since, well, I loved doing that kind of thing and it was really exciting in the early days of genome sequencing to be the first person to ask some evolution related question about the data.

Malcolm included the email he had received (which did not have a lot of detail) and he and I wrote back and forth trying to figure out exactly what this person wanted. And then I said, well, maybe the person should get in touch with me directly so I can figure out what they really want/need. It seemed unusual that someone was asking about something like that from a 10 year old paper, but, whatever.

As I was communicating with this person, I started digging through my files and my brain trying to remember exactly what had been done for this paper more than 10 years ago. I remember Malcolm and others from the Plasmodium community organizing some “jamborees” looking at the annotation of the genome. At one of those jamborees I met with some of the folks from the Sanger Center (which was one of the big players in the P. falciparum genome sequencing) with Malcolm and – after some discussion I ended up doing three main things relating to the paper, which I describe below.

Thing 1: Conserved eukaryote genes

One of my analyses was to use the genome to look for genes conserved in eukaryotes but not present in bacteria or archaea. I did this to try and find genes that could be considered likely to have been invented on the evolutionary branch leading up to the common ancestor of eukaryotes.

As an aside, at about the same time I was asked to write a News and Views for Nature about the publication of the Schizosaccharomyces pombe genome. In the N&V I had written “Genome sequencing: Brouhaha over the other yeast” I noted how the authors had used the genome to do some interesting analysis of conserved eukaryotic genes. With the help of the Nature staff I had also made a figure which demonstrated (sort of) what they were trying to do in their analysis – which was to find genes that originated on the branch leading up to the common ancestor of the eukaryotes for which genomes were available at the time. As another aside – the S. pombe genome paper and my News and Views article are freely available …

Figure 1: The tree of life, with the branches labelled according to Wood et al.’s analysis of genes that might be specific to eukaryotes versus prokaryotes, and to multicellular versus single-celled organisms. Bacteria and archaea are prokaryotes (they do not have nuclei). From Nature 415, 845-848 (21 February 2002) | doi:10.1038/nature725. The eukaryotic portion of the tree is based on Baldauf et al. 2000.

Anyway, I did a similar analysis to what was in the S. pombe genome paper and I found a reasonable number and helped write a section for the paper on this.

Comparative genome analysis with other eukaryotes for which the complete genome is available (excluding the parasite E. cuniculi) revealed that, in terms of overall genome content, P. falciparum is slightly more similar to Arabidopsis thaliana than to other taxa. Although this is consistent with phylogenetic studies (64), it could also be due to the presence in the P. falciparum nuclear genome of genes derived from plastids or from the nuclear genome of the secondary endosymbiont. Thus the apparent affinity of Plasmodium and Arabidopsis might not reflect the true phylogenetic history of the P. falciparum lineage. Comparative genomic analysis was also used to identify genes apparently duplicated in the P. falciparum lineage since it split from the lineages represented by the other completed genomes (Supplementary Table B).

There are 237 P. falciparum proteins with strong matches to proteins in all completed eukaryotic genomes but no matches to proteins, even at low stringency, in any complete prokaryotic proteome (Supplementary Table C). These proteins help to define the differences between eukaryotes and prokaryotes. Proteins in this list include those with roles in cytoskeleton construction and maintenance, chromatin packaging and modification, cell cycle regulation, intracellular signalling, transcription, translation, replication, and many proteins of unknown function. This list overlaps with, but is somewhat larger than, the list generated by an analysis of the S. pombe genome (65). The differences are probably due in part to the different stringencies used to identify the presence or absence of homologues in the two studies.

The list of genes is available as supplemental material on the Nature web site. Alas it is in MS Word format which is not the most useful thing. But more on that issue at the end of this post.

Thing 2. Searching for lineage specific duplications

Another aspect of comparative genomic analysis that I used to do for most genomes at TIGR was to look for lineage specific duplications (i.e., genes that have undergone duplications in the lineage of the species being studied to the exclusion of the lineages for which other genomes are available). The quick and dirty way we used to do this was to simply look for genes that had a better blast match to another gene from their own genome than to genes in any other genome. The list of genes we identified this way is also provided as a Word document in Supplemental materials.

Thing 3: Searching for organelle derived genes in the nuclear genome of P. falciparum

The third thing I did for the paper was to search for organelle derived genes in the nuclear genome of Plasmodium. Specifically I was looking for genes derived from the mitochondrial genome and plastid genome. For those who do not know, Plasmodium is a member of the Apicomplexa – all organisms in this group have an unusual organelle called the Apicoplast. Though the exact nature of this organelle had been debated, it’s evolutionary origins were determined by none other than Malcolm Gardner many years earlier (Gardner et al. 1994). They had shown that this organelle was in fact derived from chloroplasts (which themselves are derived from cyanoabcteria). I am shamed to say that before hanging out with Malcolm and talking about Plasmodium I did not know this. This finding of a chloroplast in an evolutionary group of eukaryotes that are not particularly closely related to plants is one of the key pieces of evidence in the “secondary endosymbiosis” hypothesis which proposes that some eukaryotes have brought into themselves as an endosymbiont a single-celled photosynthetic algae which had a chloroplast.

Anyway – here we were – with the first full genome of a member of the Apicomplexans group. And we could use it to discover some new details on plastid evolution and secondary endosymbioses. So I adapted some methods I had used in analyzing the Arabidopsis genome (see Lin et al. 1999 and AGI 2000), and searched for plastid derived genes in the nuclear genome of Plasmodium. Why look in the nuclear genome for plastid genes? Or mitochondrial genes for that matter. Well, it turns out that genes that were once in the organelle genomes frequently move to the nuclear genome of their “host”. In fact, a lot of genes move. So – if you want to study the evolution of an organism’s organelles, it is sometimes more fruitful to look in the nuclear genome than in the actual organelle’s genome. OK – now back to the Plasmodium genome. What I was doing was trying to find genes in the nuclear that had once been in the plastid genome. How would you look for these?

To find mitochondrial-derived genes I did blast searches against the same database of genomes used to study the evolution of eukaryotes but for this I looked for genes in Plasmodium that has decent matches to genes in alpha proteobacteria. And for those I then build phylogenetic trees of each gene and its homologs, then screened through all the trees to look for any in which the gene from Plasmodium grouped in a tree inside a clade with sequences from alpha proteobacteria (and allowed for mitochondrial genes from other eukaryotes to be in this clade).

To find plastid derived genes I did a similar screen except instead searched for genes that grouped in evolutionary trees with genes from cyanobacteria (or eukaryotic genes that were from plastids). The section of the paper that I helped write is below:

A large number of nuclear-encoded genes in most eukaryotic species trace their evolutionary origins to genes from organelles that have been transferred to the nucleus during the course of eukaryotic evolution. Similarity searches against other complete genomes were used to identify P. falciparum nuclear-encoded genes that may be derived from organellar genomes. Because similarity searches are not an ideal method for inferring evolutionary relatedness (66), phylogenetic analysis was used to gain a more accurate picture of the evolutionary history of these genes. Out of 200 candidates examined, 60 genes were identified as being of probable mitochondrial origin. The proteins encoded by these genes include many with known or expected mitochondrial functions (for example, the tricarboxylic acid (TCA) cycle, protein translation, oxidative damage protection, the synthesis of haem, ubiquinone and pyrimidines), as well as proteins of unknown function. Out of 300 candidates examined, 30 were identified as being of probable plastid origin, including genes with predicted roles in transcription and translation, protein cleavage and degradation, the synthesis of isoprenoids and fatty acids, and those encoding four subunits of the pyruvate dehydrogenase complex. The origin of many candidate organelle-derived genes could not be conclusively determined, in part due to the problems inherent in analysing genes of very high (A + T) content. Nevertheless, it appears likely that the total number of plastid-derived genes in P. falciparum will be significantly lower than that in the plant A. thaliana (estimated to be over 1,000). Phylogenetic analysis reveals that, as with the A. thaliana plastid, many of the genes predicted to be targeted to the apicoplast are apparently not of plastid origin. Of 333 putative apicoplast-targeted genes for which trees were constructed, only 26 could be assigned a probable plastid origin. In contrast, 35 were assigned a probable mitochondrial origin and another 85 might be of mitochondrial origin but are probably not of plastid origin (they group with eukaryotes that have not had plastids in their history, such as humans and fungi, but the relationship to mitochondrial ancestors is not clear). The apparent non-plastid origin of these genes could either be due to inaccuracies in the targeting predictions or to the co-option of genes derived from the mitochondria or the nucleus to function in the plastid, as has been shown to occur in some plant species (67).

Thing 4: Analysis of DNA repair genes

Arnab Pain from the Sanger Center and I analyzed genes predicted to be involved in DNA repair and recombination processes and wrote a section for the paper:

DNA repair processes are involved in maintenance of genomic integrity in response to DNA damaging agents such as irradiation, chemicals and oxygen radicals, as well as errors in DNA metabolism such as misincorporation during DNA replication. The P. falciparum genome encodes at least some components of the major DNA repair processes that have been found in other eukaryotes (111, 112). The core of eukaryotic nucleotide excision repair is present (XPB/Rad25, XPG/Rad2, XPF/Rad1, XPD/Rad3, ERCC1) although some highly conserved proteins with more accessory roles could not be found (for example, XPA/Rad4, XPC). The same is true for homologous recombinational repair with core proteins such as MRE11, DMC1, Rad50 and Rad51 present but accessory proteins such as NBS1 and XRS2 not yet found. These accessory proteins tend to be poorly conserved and have not been found outside of animals or yeast, respectively, and thus may be either absent or difficult to identify in P. falciparum. However, it is interesting that Archaea possess many of the core proteins but not the accessory proteins for these repair processes, suggesting that many of the accessory eukaryotic repair proteins evolved after P. falciparum diverged from other eukaryotes.

The presence of MutL and MutS homologues including possible orthologues of MSH2, MSH6, MLH1 and PMS1 suggests that P. falciparum can perform post-replication mismatch repair. Orthologues of MSH4 and MSH5, which are involved in meiotic crossing over in other eukaryotes, are apparently absent in P. falciparum. The repair of at least some damaged bases may be performed by the combined action of the four base excision repair glycosylase homologues and one of the apurinic/apyrimidinic (AP) endonucleases (homologues of Xth and Nfo are present). Experimental evidence suggests that this is done by the long-patch pathway (113).

The presence of a class II photolyase homologue is intriguing, because it is not clear whether P. falciparum is exposed to significant amounts of ultraviolet irradiation during its life cycle. It is possible that this protein functions as a blue-light receptor instead of a photolyase, as do members of this gene family in some organisms such as humans. Perhaps most interesting is the apparent absence of homologues of any of the genes encoding enzymes known to be involved in non-homologous end joining (NHEJ) in eukaryotes (for example, Ku70, Ku86, Ligase IV and XRCC1)(112). NHEJ is involved in the repair of double strand breaks induced by irradiation and chemicals in other eukaryotes (such as yeast and humans), and is also involved in a few cellular processes that create double strand breaks (for example, VDJ recombination in the immune system in humans). The role of NHEJ in repairing radiation-induced double strand breaks varies between species (114). For example, in humans, cells with defects in NHEJ are highly sensitive to -irradiation while yeast mutants are not. Double strand breaks in yeast are repaired primarily by homologous recombination. As NHEJ is involved in regulating telomere stability in other organisms, its apparent absence in P. falciparum may explain some of the unusual properties of the telomeres in this species (115).

Back to the story
Anyway … back to the story. I do not have current access to all of TIGR’s old computer systems which is where my searches for the genome paper reside. But I figured I might have some notes somewhere on my computer about what blast parameters I used for these searches. And amazingly I did. As I was getting ready to write back to Malcolm and to the person who has asked for the information I decided to double check to see what was in the paper. And amazingly, much of the detail was right there all along.

Plasmodium falciparum proteins were searched against a database of proteins from all complete genomes as well as from a set of organelle, plasmid and viral genomes. Putative recently duplicated genes were identified as those encoding proteins with better BLASTP matches (based on E value with a 10-15 cutoff) to other proteins in P. falciparum than to proteins in any other species. Proteins of possible organellar descent were identified as those for which one of the top six prokaryotic matches (based on E value) was to either a protein encoded by an organelle genome or by a species related to the organelle ancestors (members of the Rickettsia subgroup of the -Proteobacteria or cyanobacteria). Because BLAST matches are not an ideal method of inferring evolutionary history, phylogenetic analysis was conducted for all these proteins. For phylogenetic analysis, all homologues of each protein were identified by BLASTP searches of complete genomes and of a non-redundant protein database. Sequences were aligned using CLUSTALW, and phylogenetic trees were inferred using the neighbour-joining algorithms of CLUSTALW and PHYLIP. For comparative analysis of eukaryotes, the proteomes of all eukaryotes for which complete genomes are available (except the highly reduced E. cuniculi) were searched against each other. The proportion of proteins in each eukaryotic species that had a BLASTP match in each of the other eukaryotic species was determined, and used to infer a ‘whole-genome tree’ using the neighbour-joining algorithm. Possible eukaryotic conserved and specific proteins were identified as those with matches to all the complete eukaryotic genomes (10-30 E-value cutoff) but without matches to any complete prokaryotic genome (10-15 cutoff).

Alas, I cannot for the life of me find what other parameters I used for the blastp searches. I am 99.9999% sure I used default settings but alas, I don’t know what default settings for blast were in that era. And I am not even sure which version of blastp was installed on the TIGR computer systems then. I certainly need to do a better job of making sure everything I do is truly reproducible.

Reproducibility

This all brings me to the actual real part of this story. Reproducibility. It is a big deal. Anyone should be able to reproduce what was done in a study. And alas, it is difficult to do that when not all the methods are fully described. And one should also provide intermediate results so that people to do not have to redo everything you did in a study but can just reproduce part of it. It would be good to have, for example, released all the phylogenetic trees from the analysis of organellar genes in Plasmodium. Alas, I do not seem to have all of these files as they were stored in a directory at TIGR dedicated to this genome project and as I am no longer at TIGR I do not have ready access to that material. It is probably still lounging around somewhere on the JCVI computer systems (TIGR alas, no longer officially exists … it was swallowed by the J. Craig Venter Institute …). But I will keep digging and I will post them to some place like FigShare if/when I find them.

Perhaps more importantly, I will be working with my lab to make sure that in the future we store/record/make available EVERYTHING that would allow people to reproduce, re-analyze, re-jigger, re-whatever anything from our papers.

The key lesson – plan in advance for how you are going to share results, methods, data, etc …

Interesting new paper: "Proving universal common ancestry with similar sequences"

Just discovered an interesting paper by Leonardo de Oliveira Martins and David Posada. It is titled “Proving universal common ancestry with similar sequences.” It relates to a paper by Douglas Theobald: “A formal test of the theory of universal common ancestry. Nature 2010; 465:219-22.” Although the latter paper is not openly available the more recent one is.

The new paper is worth a look. Not sure about the Theobald one as I do not have access from home.

Am hoping Leonardo writes more about this in his blog: Bayesian Procedures in Biology ….

Antibiotic use in animals (may) lead to superbugs in people #mBIO

New paper in mBIO of potential interest from Lance Price et al.: Staphylococcus aureus CC398: Host Adaptation and Emergence of Methicillin Resistance in Livestock. For those not in the know, mBIO is a relatively new Open Access journal from the American Society for Microbiology. The paper discusses genomic studies of MRSA which has led the authors to conclude that antibiotic use in animals may contribute to the rise and spread of superbugs in people.

From here. Maximum-parsimony tree of the 89 CC398 isolates (including ST398SO385) based on 4,238 total SNPs, including 1,102 parsimony-informative SNPs with a CI of 0.9591. Clades and groups of importance are labeled in a hierarchical fashion to facilitate description in the text. The tree was rooted with clade I based on an iterative selection process that identified this group as the most ancestral (see Materials and Methods). COO, country of origin; AT, Austria; BE, Belgium; CA, Canada; CH, Switzerland; CN, China; DE, Germany; DK, Denmark; ES, Spain; FI, Finland; FR, France; GF, French Guiana; HU, Hungary; IT, Italy; NL, The Netherlands; PE, Peru; PL, Poland; PT, Portugal; SI, Slovenia; US, United States; P, pig; H, human; R, horse; T, turkey; B, bovine; MET, methicillin susceptibility; R, resistant; S, susceptible.

The figure above is the only figure in the main text of the paper. There are others in supplemental information which seems a bit strange to me – why put anything in supplemental information when the paper is only released online? Or at least have thumbnail images for all figures in the main text …

Anyway, the paper and press release got picked up by many newsy places. See for example:

Pig-to-Human ‘Superbug’ May Be Due to Animal Antibiotics (US News)
MRSA Staph Strain Developed Drug Resistance in Your Burger (US News)
How Using Antibiotics In Animal Feed Creates Superbugs (NPR blog)
MRSA in Livestock May Spread to Humans (ABC news)
Staph Turns into Drug-Resistant Superbug on Farms (SciAm blog)

I note – the Press Release is MUCH better than the last one that was about a paper by Price that I wrote about here: The Tree of Life: #PLoSOne paper keywords revealing: (#Penis #Microbiome #Circumcision #HIV); press release misleading … Lance was awesomally quick to respond to my complaints about that PR. The PR for this paper is not so bad — a bit over the top in some of the quotes – but no need for comments I think.

Citation:

Price LB, et al. 2012. Staphylococcus aureus CC398: host adaptation and emergence of methicillin resistance in livestock. mBio 3(1):e00305-11. doi:10.1128/mBio.00305-11.

UPDATE 2/21 5:30 PM: an alternative (and much more pleasing) press release from ASM is here.

Draft blog post cleanup #1: Divide and Conquer to Find Orthologs

OK – I am cleaning out my draft blog post list. I start many posts and don’t finish them and then they sit in the draft section of blogger. Well, I am going to try to clean some of that up by writing some mini posts. Here is the first —

Saw an interesting paper worth checking out:
PLoS ONE: Calculating Orthologs in Bacteria and Archaea: A Divide and Conquer Approach

It describes not only a way to speed up continual ortholog annotation in bacterial and archaeal genomes but also is linked to an ongoing open code development project.

Here is the abstract:

Among proteins, orthologs are defined as those that are derived by vertical descent from a single progenitor in the last common ancestor of their host organisms. Our goal is to compute a complete set of protein orthologs derived from all currently available complete bacterial and archaeal genomes. Traditional approaches typically rely on all-against-all BLAST searching which is prohibitively expensive in terms of hardware requirements or computational time (requiring an estimated 18 months or more on a typical server). Here, we present xBASE-Orth, a system for ongoing ortholog annotation, which applies a “divide and conquer” approach and adopts a pragmatic scheme that trades accuracy for speed. Starting at species level, xBASE-Orth carefully constructs and uses pan-genomes as proxies for the full collections of coding sequences at each level as it progressively climbs the taxonomic tree using the previously computed data. This leads to a significant decrease in the number of alignments that need to be performed, which translates into faster computation, making ortholog computation possible on a global scale. Using xBASE-Orth, we analyzed an NCBI collection of 1,288 bacterial and 94 archaeal complete genomes with more than 4 million coding sequences in 5 weeks and predicted more than 700 million ortholog pairs, clustered in 175,531 orthologous groups. We have also identified sets of highly conserved bacterial and archaeal orthologs and in so doing have highlighted anomalies in genome annotation and in the proposed composition of the minimal bacterial genome. In summary, our approach allows for scalable and efficient computation of the bacterial and archaeal ortholog annotations. In addition, due to its hierarchical nature, it is suitable for incorporating novel complete genomes and alternative genome annotations. The computed ortholog data and a continuously evolving set of applications based on it are integrated in the xBASE database, available at http://www.xbase.ac.uk/.

Definitely worth checking out.

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: