A few days ago I got an email from a colleague who I had not seen in many years. It was from Malcolm Gardner who worked at TIGR when I was there and is now at Seattle Biomed.
His email was related to the 2002 publication of the complete genome sequence of Plasmodium falciparum – the causative agent of most human malaria cases – for which he was the lead author. Someone had emailed Malcolm asking if he could provide details about the settings used in the blast searches that were part of the evolutionary analyses of the paper. The paper is freely available at Nature – at least for now – every once in a while the Nature Publishing Group seems to put it behind a paywall despite their promises not to.
Malcolm was contacting me because I had run / coordinated much of the evolutionary analysis reported in that paper. I note – as one of the only evolution focused people at TIGR it was pretty common for people to come to me and ask if I could help them with their genome. I pretty much always said yes since, well, I loved doing that kind of thing and it was really exciting in the early days of genome sequencing to be the first person to ask some evolution related question about the data.
Malcolm included the email he had received (which did not have a lot of detail) and he and I wrote back and forth trying to figure out exactly what this person wanted. And then I said, well, maybe the person should get in touch with me directly so I can figure out what they really want/need. It seemed unusual that someone was asking about something like that from a 10 year old paper, but, whatever.
Thing 1: Conserved eukaryote genes
One of my analyses was to use the genome to look for genes conserved in eukaryotes but not present in bacteria or archaea. I did this to try and find genes that could be considered likely to have been invented on the evolutionary branch leading up to the common ancestor of eukaryotes.
As an aside, at about the same time I was asked to write a News and Views for Nature about the publication of the Schizosaccharomyces pombe genome. In the N&V I had written “Genome sequencing: Brouhaha over the other yeast” I noted how the authors had used the genome to do some interesting analysis of conserved eukaryotic genes. With the help of the Nature staff I had also made a figure which demonstrated (sort of) what they were trying to do in their analysis – which was to find genes that originated on the branch leading up to the common ancestor of the eukaryotes for which genomes were available at the time. As another aside – the S. pombe genome paper and my News and Views article are freely available …
|Figure 1: The tree of life, with the branches labelled according to Wood et al.’s analysis of genes that might be specific to eukaryotes versus prokaryotes, and to multicellular versus single-celled organisms. Bacteria and archaea are prokaryotes (they do not have nuclei). From Nature 415, 845-848 (21 February 2002) | doi:10.1038/nature725. The eukaryotic portion of the tree is based on Baldauf et al. 2000.|
Anyway, I did a similar analysis to what was in the S. pombe genome paper and I found a reasonable number and helped write a section for the paper on this.
Comparative genome analysis with other eukaryotes for which the complete genome is available (excluding the parasite E. cuniculi) revealed that, in terms of overall genome content, P. falciparum is slightly more similar to Arabidopsis thaliana than to other taxa. Although this is consistent with phylogenetic studies (64), it could also be due to the presence in the P. falciparum nuclear genome of genes derived from plastids or from the nuclear genome of the secondary endosymbiont. Thus the apparent affinity of Plasmodium and Arabidopsis might not reflect the true phylogenetic history of the P. falciparum lineage. Comparative genomic analysis was also used to identify genes apparently duplicated in the P. falciparum lineage since it split from the lineages represented by the other completed genomes (Supplementary Table B).
There are 237 P. falciparum proteins with strong matches to proteins in all completed eukaryotic genomes but no matches to proteins, even at low stringency, in any complete prokaryotic proteome (Supplementary Table C). These proteins help to define the differences between eukaryotes and prokaryotes. Proteins in this list include those with roles in cytoskeleton construction and maintenance, chromatin packaging and modification, cell cycle regulation, intracellular signalling, transcription, translation, replication, and many proteins of unknown function. This list overlaps with, but is somewhat larger than, the list generated by an analysis of the S. pombe genome (65). The differences are probably due in part to the different stringencies used to identify the presence or absence of homologues in the two studies.
The list of genes is available as supplemental material on the Nature web site. Alas it is in MS Word format which is not the most useful thing. But more on that issue at the end of this post.
Thing 2. Searching for lineage specific duplications
Another aspect of comparative genomic analysis that I used to do for most genomes at TIGR was to look for lineage specific duplications (i.e., genes that have undergone duplications in the lineage of the species being studied to the exclusion of the lineages for which other genomes are available). The quick and dirty way we used to do this was to simply look for genes that had a better blast match to another gene from their own genome than to genes in any other genome. The list of genes we identified this way is also provided as a Word document in Supplemental materials.
Thing 3: Searching for organelle derived genes in the nuclear genome of P. falciparum
A large number of nuclear-encoded genes in most eukaryotic species trace their evolutionary origins to genes from organelles that have been transferred to the nucleus during the course of eukaryotic evolution. Similarity searches against other complete genomes were used to identify P. falciparum nuclear-encoded genes that may be derived from organellar genomes. Because similarity searches are not an ideal method for inferring evolutionary relatedness (66), phylogenetic analysis was used to gain a more accurate picture of the evolutionary history of these genes. Out of 200 candidates examined, 60 genes were identified as being of probable mitochondrial origin. The proteins encoded by these genes include many with known or expected mitochondrial functions (for example, the tricarboxylic acid (TCA) cycle, protein translation, oxidative damage protection, the synthesis of haem, ubiquinone and pyrimidines), as well as proteins of unknown function. Out of 300 candidates examined, 30 were identified as being of probable plastid origin, including genes with predicted roles in transcription and translation, protein cleavage and degradation, the synthesis of isoprenoids and fatty acids, and those encoding four subunits of the pyruvate dehydrogenase complex. The origin of many candidate organelle-derived genes could not be conclusively determined, in part due to the problems inherent in analysing genes of very high (A + T) content. Nevertheless, it appears likely that the total number of plastid-derived genes in P. falciparum will be significantly lower than that in the plant A. thaliana (estimated to be over 1,000). Phylogenetic analysis reveals that, as with the A. thaliana plastid, many of the genes predicted to be targeted to the apicoplast are apparently not of plastid origin. Of 333 putative apicoplast-targeted genes for which trees were constructed, only 26 could be assigned a probable plastid origin. In contrast, 35 were assigned a probable mitochondrial origin and another 85 might be of mitochondrial origin but are probably not of plastid origin (they group with eukaryotes that have not had plastids in their history, such as humans and fungi, but the relationship to mitochondrial ancestors is not clear). The apparent non-plastid origin of these genes could either be due to inaccuracies in the targeting predictions or to the co-option of genes derived from the mitochondria or the nucleus to function in the plastid, as has been shown to occur in some plant species (67).
Arnab Pain from the Sanger Center and I analyzed genes predicted to be involved in DNA repair and recombination processes and wrote a section for the paper:
DNA repair processes are involved in maintenance of genomic integrity in response to DNA damaging agents such as irradiation, chemicals and oxygen radicals, as well as errors in DNA metabolism such as misincorporation during DNA replication. The P. falciparum genome encodes at least some components of the major DNA repair processes that have been found in other eukaryotes (111, 112). The core of eukaryotic nucleotide excision repair is present (XPB/Rad25, XPG/Rad2, XPF/Rad1, XPD/Rad3, ERCC1) although some highly conserved proteins with more accessory roles could not be found (for example, XPA/Rad4, XPC). The same is true for homologous recombinational repair with core proteins such as MRE11, DMC1, Rad50 and Rad51 present but accessory proteins such as NBS1 and XRS2 not yet found. These accessory proteins tend to be poorly conserved and have not been found outside of animals or yeast, respectively, and thus may be either absent or difficult to identify in P. falciparum. However, it is interesting that Archaea possess many of the core proteins but not the accessory proteins for these repair processes, suggesting that many of the accessory eukaryotic repair proteins evolved after P. falciparum diverged from other eukaryotes.
The presence of MutL and MutS homologues including possible orthologues of MSH2, MSH6, MLH1 and PMS1 suggests that P. falciparum can perform post-replication mismatch repair. Orthologues of MSH4 and MSH5, which are involved in meiotic crossing over in other eukaryotes, are apparently absent in P. falciparum. The repair of at least some damaged bases may be performed by the combined action of the four base excision repair glycosylase homologues and one of the apurinic/apyrimidinic (AP) endonucleases (homologues of Xth and Nfo are present). Experimental evidence suggests that this is done by the long-patch pathway (113).
The presence of a class II photolyase homologue is intriguing, because it is not clear whether P. falciparum is exposed to significant amounts of ultraviolet irradiation during its life cycle. It is possible that this protein functions as a blue-light receptor instead of a photolyase, as do members of this gene family in some organisms such as humans. Perhaps most interesting is the apparent absence of homologues of any of the genes encoding enzymes known to be involved in non-homologous end joining (NHEJ) in eukaryotes (for example, Ku70, Ku86, Ligase IV and XRCC1)(112). NHEJ is involved in the repair of double strand breaks induced by irradiation and chemicals in other eukaryotes (such as yeast and humans), and is also involved in a few cellular processes that create double strand breaks (for example, VDJ recombination in the immune system in humans). The role of NHEJ in repairing radiation-induced double strand breaks varies between species (114). For example, in humans, cells with defects in NHEJ are highly sensitive to -irradiation while yeast mutants are not. Double strand breaks in yeast are repaired primarily by homologous recombination. As NHEJ is involved in regulating telomere stability in other organisms, its apparent absence in P. falciparum may explain some of the unusual properties of the telomeres in this species (115).
Back to the story
Anyway … back to the story. I do not have current access to all of TIGR’s old computer systems which is where my searches for the genome paper reside. But I figured I might have some notes somewhere on my computer about what blast parameters I used for these searches. And amazingly I did. As I was getting ready to write back to Malcolm and to the person who has asked for the information I decided to double check to see what was in the paper. And amazingly, much of the detail was right there all along.
Plasmodium falciparum proteins were searched against a database of proteins from all complete genomes as well as from a set of organelle, plasmid and viral genomes. Putative recently duplicated genes were identified as those encoding proteins with better BLASTP matches (based on E value with a 10-15 cutoff) to other proteins in P. falciparum than to proteins in any other species. Proteins of possible organellar descent were identified as those for which one of the top six prokaryotic matches (based on E value) was to either a protein encoded by an organelle genome or by a species related to the organelle ancestors (members of the Rickettsia subgroup of the -Proteobacteria or cyanobacteria). Because BLAST matches are not an ideal method of inferring evolutionary history, phylogenetic analysis was conducted for all these proteins. For phylogenetic analysis, all homologues of each protein were identified by BLASTP searches of complete genomes and of a non-redundant protein database. Sequences were aligned using CLUSTALW, and phylogenetic trees were inferred using the neighbour-joining algorithms of CLUSTALW and PHYLIP. For comparative analysis of eukaryotes, the proteomes of all eukaryotes for which complete genomes are available (except the highly reduced E. cuniculi) were searched against each other. The proportion of proteins in each eukaryotic species that had a BLASTP match in each of the other eukaryotic species was determined, and used to infer a ‘whole-genome tree’ using the neighbour-joining algorithm. Possible eukaryotic conserved and specific proteins were identified as those with matches to all the complete eukaryotic genomes (10-30 E-value cutoff) but without matches to any complete prokaryotic genome (10-15 cutoff).
Alas, I cannot for the life of me find what other parameters I used for the blastp searches. I am 99.9999% sure I used default settings but alas, I don’t know what default settings for blast were in that era. And I am not even sure which version of blastp was installed on the TIGR computer systems then. I certainly need to do a better job of making sure everything I do is truly reproducible.
This all brings me to the actual real part of this story. Reproducibility. It is a big deal. Anyone should be able to reproduce what was done in a study. And alas, it is difficult to do that when not all the methods are fully described. And one should also provide intermediate results so that people to do not have to redo everything you did in a study but can just reproduce part of it. It would be good to have, for example, released all the phylogenetic trees from the analysis of organellar genes in Plasmodium. Alas, I do not seem to have all of these files as they were stored in a directory at TIGR dedicated to this genome project and as I am no longer at TIGR I do not have ready access to that material. It is probably still lounging around somewhere on the JCVI computer systems (TIGR alas, no longer officially exists … it was swallowed by the J. Craig Venter Institute …). But I will keep digging and I will post them to some place like FigShare if/when I find them.
Perhaps more importantly, I will be working with my lab to make sure that in the future we store/record/make available EVERYTHING that would allow people to reproduce, re-analyze, re-jigger, re-whatever anything from our papers.
The key lesson – plan in advance for how you are going to share results, methods, data, etc …
7 thoughts on “A blast from the past: Plasmodium, plastids, phylogeny, and reproducibility”
We are clearly on the same wavelength today :). http://ivory.idyll.org/blog/anecdotal-science.html
ha … we should start a club
Let's call ourselves “scientists”!
This comment has been removed by a blog administrator.
The good folks over at ENCODE opted to make a virtual machine image available for download to coincide with the release of their publications today.
Available at http://scofield.bx.psu.edu/~dannon/encodevm/
How's that for transparency and reproducibility!
cool – have you run it?
On a related note, I just read this paper tonight: “Next-generation sequencing data interpretation: enhancing reproducibility and accessibility” http://www.ncbi.nlm.nih.gov/pubmed/22898652 (not open access – frustrating because its a good review). Take home points were 1) Life scientists can't escape computational biology now given the NGS revolution, so they're going to have to get used to it 2) Always provide access to primary data, 3) Record versions of ALL auxiliary datasets used during analysis (e.g. genome assembly releases), 4) Note the EXACT versions of software used 5) Record all parameters, even if defaults are used, 6) Provide all custom scripts as supplementary data.
The authors also made the point that it is very painful right now for researchers to even adhere to these basic guidelines, because most bioinformatics software is just not designed to automatically log versions/parameters/commands and blatantly deliver them to users as output files. So this paper was really an eye opener for me about just how *important* it really is to record and publish EVERYTHING related to original data analysis, for reproducibility purposes. I'm all for adopting that kind of policy!