New paper from some in the Eisen lab: phylogeny driven sequencing of cyanobacteria

(Cross post from my lab blog)

Quick post here.  This paper came out a few months ago but it was not freely available so I did not write about it (it is in PNAS but was not published with the PNAS Open Option — not my choice – lead author did not choose that option and I was not really in the loop when that choice was made).

Improving the coverage of the cyanobacterial phylum using diversity-driven genome sequencing. [Proc Natl Acad Sci U S A. 2013] – PubMed – NCBI.
Anyway – it is now in Pubmed Central and at least freely available so I felt OK posting about it now.  It is in a way a follow up to the “A phylogeny driven genomic encyclopedia of bacteria and archaea” paper (AKA GEBA) from 2009 with this paper a zooming in on the cyanobacteria.

A blast from the past: Plasmodium, plastids, phylogeny, and reproducibility

A few days ago I got an email from a colleague who I had not seen in many years.  It was from Malcolm Gardner who worked at TIGR when I was there and is now at Seattle Biomed.

His email was related to the 2002 publication of the complete genome sequence of Plasmodium falciparum the causative agent of most human malaria cases –  for which he was the lead author.   Someone had emailed Malcolm asking if he could provide details about the settings used in the blast searches that were part of the evolutionary analyses of the paper.   The paper is freely available at Nature – at least for now – every once in a while the Nature Publishing Group seems to put it behind a paywall despite their promises not to.

Malcolm was contacting me because I had run / coordinated much of the evolutionary analysis reported in that paper.  I note – as one of the only evolution focused people at TIGR it was pretty common for people to come to me and ask if I could help them with their genome.  I pretty much always said yes since, well, I loved doing that kind of thing and it was really exciting in the early days of genome sequencing to be the first person to ask some evolution related question about the data.

Malcolm included the email he had received (which did not have a lot of detail) and he and I wrote back and forth trying to figure out exactly what this person wanted.  And then I said, well, maybe the person should get in touch with me directly so I can figure out what they really want/need.  It seemed unusual that someone was asking about something like that from a 10 year old paper, but, whatever.  

As I was communicating with this person, I started digging through my files and my brain trying to remember exactly what had been done for this paper more than 10 years ago.  I remember Malcolm and others from the Plasmodium community organizing some “jamborees” looking at the annotation of the genome. At one of those jamborees I met with some of the folks from the Sanger Center (which was one of the big players in the P. falciparum genome sequencing) with Malcolm and – after some discussion I ended up doing three main things relating to the paper, which I describe below.

Thing 1: Conserved eukaryote genes

One of my analyses was to use the genome to look for genes conserved in eukaryotes but not present in bacteria or archaea.  I did this to try and find genes that could be considered likely to have been invented on the evolutionary branch leading up to the common ancestor of eukaryotes.

As an aside, at about the same time I was asked to write a News and Views for Nature about the publication of the Schizosaccharomyces pombe genome.  In the N&V I had written “Genome sequencing: Brouhaha over the other yeast” I noted how the authors had used the genome to do some interesting analysis of conserved eukaryotic genes.  With the help of the Nature staff I had also made a figure which demonstrated (sort of) what they were trying to do in their analysis – which was to find genes that originated on the branch leading up to the common ancestor of the eukaryotes for which genomes were available at the time.  As another aside – the S. pombe genome paper and my News and Views article are freely available …

Figure 1: The tree of life, with the branches labelled according to Wood et al.’s analysis of genes that might be specific to eukaryotes versus prokaryotes, and to multicellular versus single-celled organisms. Bacteria and archaea are prokaryotes (they do not have nuclei). From Nature 415, 845-848 (21 February 2002) | doi:10.1038/nature725. The eukaryotic portion of the tree is based on Baldauf et al. 2000

Anyway, I did a similar analysis to what was in the S. pombe genome paper and I found a reasonable number and helped write a section for the paper on this.

Comparative genome analysis with other eukaryotes for which the complete genome is available (excluding the parasite E. cuniculi) revealed that, in terms of overall genome content, P. falciparum is slightly more similar to Arabidopsis thaliana than to other taxa. Although this is consistent with phylogenetic studies (64), it could also be due to the presence in the P. falciparum nuclear genome of genes derived from plastids or from the nuclear genome of the secondary endosymbiont. Thus the apparent affinity of Plasmodium and Arabidopsis might not reflect the true phylogenetic history of the P. falciparum lineage. Comparative genomic analysis was also used to identify genes apparently duplicated in the P. falciparum lineage since it split from the lineages represented by the other completed genomes (Supplementary Table B). 

There are 237 P. falciparum proteins with strong matches to proteins in all completed eukaryotic genomes but no matches to proteins, even at low stringency, in any complete prokaryotic proteome (Supplementary Table C). These proteins help to define the differences between eukaryotes and prokaryotes. Proteins in this list include those with roles in cytoskeleton construction and maintenance, chromatin packaging and modification, cell cycle regulation, intracellular signalling, transcription, translation, replication, and many proteins of unknown function. This list overlaps with, but is somewhat larger than, the list generated by an analysis of the S. pombe genome (65). The differences are probably due in part to the different stringencies used to identify the presence or absence of homologues in the two studies.

The list of genes is available as supplemental material on the Nature web site.  Alas it is in MS Word format which is not the most useful thing.  But more on that issue at the end of this post.

Thing 2. Searching for lineage specific duplications

Another aspect of comparative genomic analysis that I used to do for most genomes at TIGR was to look for lineage specific duplications (i.e., genes that have undergone duplications in the lineage of the species being studied to the exclusion of the lineages for which other genomes are available).  The quick and dirty way we used to do this was to simply look for genes that had a better blast match to another gene from their own genome than to genes in any other genome.  The list of genes we identified this way is also provided as a Word document in Supplemental materials.

Thing 3: Searching for organelle derived genes in the nuclear genome of P. falciparum

The third thing I did for the paper was to search for organelle derived genes in the nuclear genome of Plasmodium.  Specifically I was looking for genes derived from the mitochondrial genome and plastid genome.  For those who do not know, Plasmodium is a member of the Apicomplexa – all organisms in this group have an unusual organelle called the Apicoplast.  Though the exact nature of this organelle had been debated, it’s evolutionary origins were determined by none other than Malcolm Gardner many years earlier (Gardner et al. 1994). They had shown that this organelle was in fact derived from chloroplasts (which themselves are derived from cyanoabcteria).  I am shamed to say that before hanging out with Malcolm and talking about Plasmodium I did not know this.  This finding of a chloroplast in an evolutionary group of eukaryotes that are not particularly closely related to plants is one of the key pieces of evidence in the “secondary endosymbiosis” hypothesis which proposes that some eukaryotes have brought into themselves as an endosymbiont a single-celled photosynthetic algae which had a chloroplast.  
Anyway – here we were – with the first full genome of a member of the Apicomplexans group.  And we could use it to discover some new details on plastid evolution and secondary endosymbioses.  So I adapted some methods I had used in analyzing the Arabidopsis genome (see Lin et al. 1999 and AGI 2000), and searched for plastid derived genes in the nuclear genome of Plasmodium.  Why look in the nuclear genome for plastid genes?  Or mitochondrial genes for that matter.  Well, it turns out that genes that were once in the organelle genomes frequently move to the nuclear genome of their “host”.  In fact, a lot of genes move.  So – if you want to study the evolution of an organism’s organelles, it is sometimes more fruitful to look in the nuclear genome than in the actual organelle’s genome.  OK – now back to the Plasmodium genome.  What I was doing was trying to find genes in the nuclear that had once been in the plastid genome.  How would you look for these?  
To find mitochondrial-derived genes I did blast searches against the same database of genomes used to study the evolution of eukaryotes but for this I looked for genes in Plasmodium that has decent matches to genes in alpha proteobacteria.  And for those I then build phylogenetic trees of each gene and its homologs, then screened through all the trees to look for any in which the gene from Plasmodium grouped in a tree inside a clade with sequences from alpha proteobacteria (and allowed for mitochondrial genes from other eukaryotes to be in this clade).  
To find plastid derived genes I did a similar screen except instead searched for genes that grouped in evolutionary trees with genes from cyanobacteria (or eukaryotic genes that were from plastids).  The section of the paper that I helped write is below:

A large number of nuclear-encoded genes in most eukaryotic species trace their evolutionary origins to genes from organelles that have been transferred to the nucleus during the course of eukaryotic evolution. Similarity searches against other complete genomes were used to identify P. falciparum nuclear-encoded genes that may be derived from organellar genomes. Because similarity searches are not an ideal method for inferring evolutionary relatedness (66), phylogenetic analysis was used to gain a more accurate picture of the evolutionary history of these genes. Out of 200 candidates examined, 60 genes were identified as being of probable mitochondrial origin. The proteins encoded by these genes include many with known or expected mitochondrial functions (for example, the tricarboxylic acid (TCA) cycle, protein translation, oxidative damage protection, the synthesis of haem, ubiquinone and pyrimidines), as well as proteins of unknown function. Out of 300 candidates examined, 30 were identified as being of probable plastid origin, including genes with predicted roles in transcription and translation, protein cleavage and degradation, the synthesis of isoprenoids and fatty acids, and those encoding four subunits of the pyruvate dehydrogenase complex. The origin of many candidate organelle-derived genes could not be conclusively determined, in part due to the problems inherent in analysing genes of very high (A + T) content. Nevertheless, it appears likely that the total number of plastid-derived genes in P. falciparum will be significantly lower than that in the plant A. thaliana (estimated to be over 1,000). Phylogenetic analysis reveals that, as with the A. thaliana plastid, many of the genes predicted to be targeted to the apicoplast are apparently not of plastid origin. Of 333 putative apicoplast-targeted genes for which trees were constructed, only 26 could be assigned a probable plastid origin. In contrast, 35 were assigned a probable mitochondrial origin and another 85 might be of mitochondrial origin but are probably not of plastid origin (they group with eukaryotes that have not had plastids in their history, such as humans and fungi, but the relationship to mitochondrial ancestors is not clear). The apparent non-plastid origin of these genes could either be due to inaccuracies in the targeting predictions or to the co-option of genes derived from the mitochondria or the nucleus to function in the plastid, as has been shown to occur in some plant species (67).

Thing 4: Analysis of DNA repair genes 

Arnab Pain from the Sanger Center and I analyzed genes predicted to be involved in DNA repair and recombination processes and wrote a section for the paper:

DNA repair processes are involved in maintenance of genomic integrity in response to DNA damaging agents such as irradiation, chemicals and oxygen radicals, as well as errors in DNA metabolism such as misincorporation during DNA replication. The P. falciparum genome encodes at least some components of the major DNA repair processes that have been found in other eukaryotes (111, 112). The core of eukaryotic nucleotide excision repair is present (XPB/Rad25, XPG/Rad2, XPF/Rad1, XPD/Rad3, ERCC1) although some highly conserved proteins with more accessory roles could not be found (for example, XPA/Rad4, XPC). The same is true for homologous recombinational repair with core proteins such as MRE11, DMC1, Rad50 and Rad51 present but accessory proteins such as NBS1 and XRS2 not yet found. These accessory proteins tend to be poorly conserved and have not been found outside of animals or yeast, respectively, and thus may be either absent or difficult to identify in P. falciparum. However, it is interesting that Archaea possess many of the core proteins but not the accessory proteins for these repair processes, suggesting that many of the accessory eukaryotic repair proteins evolved after P. falciparum diverged from other eukaryotes. 

The presence of MutL and MutS homologues including possible orthologues of MSH2, MSH6, MLH1 and PMS1 suggests that P. falciparum can perform post-replication mismatch repair. Orthologues of MSH4 and MSH5, which are involved in meiotic crossing over in other eukaryotes, are apparently absent in P. falciparum. The repair of at least some damaged bases may be performed by the combined action of the four base excision repair glycosylase homologues and one of the apurinic/apyrimidinic (AP) endonucleases (homologues of Xth and Nfo are present). Experimental evidence suggests that this is done by the long-patch pathway (113). 

The presence of a class II photolyase homologue is intriguing, because it is not clear whether P. falciparum is exposed to significant amounts of ultraviolet irradiation during its life cycle. It is possible that this protein functions as a blue-light receptor instead of a photolyase, as do members of this gene family in some organisms such as humans. Perhaps most interesting is the apparent absence of homologues of any of the genes encoding enzymes known to be involved in non-homologous end joining (NHEJ) in eukaryotes (for example, Ku70, Ku86, Ligase IV and XRCC1)(112). NHEJ is involved in the repair of double strand breaks induced by irradiation and chemicals in other eukaryotes (such as yeast and humans), and is also involved in a few cellular processes that create double strand breaks (for example, VDJ recombination in the immune system in humans). The role of NHEJ in repairing radiation-induced double strand breaks varies between species (114). For example, in humans, cells with defects in NHEJ are highly sensitive to -irradiation while yeast mutants are not. Double strand breaks in yeast are repaired primarily by homologous recombination. As NHEJ is involved in regulating telomere stability in other organisms, its apparent absence in P. falciparum may explain some of the unusual properties of the telomeres in this species (115).

Back to the story
Anyway … back to the story.  I do not have current access to all of TIGR’s old computer systems which is where my searches for the genome paper reside.  But I figured I might have some notes somewhere on my computer about what blast parameters I used for these searches.  And amazingly I did.  As I was getting ready to write back to Malcolm and to the person who has asked for the information I decided to double check to see what was in the paper.  And amazingly, much of the detail was right there all along.   

Plasmodium falciparum proteins were searched against a database of proteins from all complete genomes as well as from a set of organelle, plasmid and viral genomes. Putative recently duplicated genes were identified as those encoding proteins with better BLASTP matches (based on E value with a 10-15 cutoff) to other proteins in P. falciparum than to proteins in any other species. Proteins of possible organellar descent were identified as those for which one of the top six prokaryotic matches (based on E value) was to either a protein encoded by an organelle genome or by a species related to the organelle ancestors (members of the Rickettsia subgroup of the -Proteobacteria or cyanobacteria). Because BLAST matches are not an ideal method of inferring evolutionary history, phylogenetic analysis was conducted for all these proteins. For phylogenetic analysis, all homologues of each protein were identified by BLASTP searches of complete genomes and of a non-redundant protein database. Sequences were aligned using CLUSTALW, and phylogenetic trees were inferred using the neighbour-joining algorithms of CLUSTALW and PHYLIP. For comparative analysis of eukaryotes, the proteomes of all eukaryotes for which complete genomes are available (except the highly reduced E. cuniculi) were searched against each other. The proportion of proteins in each eukaryotic species that had a BLASTP match in each of the other eukaryotic species was determined, and used to infer a ‘whole-genome tree’ using the neighbour-joining algorithm. Possible eukaryotic conserved and specific proteins were identified as those with matches to all the complete eukaryotic genomes (10-30 E-value cutoff) but without matches to any complete prokaryotic genome (10-15 cutoff).

Alas, I cannot for the life of me find what other parameters I used for the blastp searches.  I am 99.9999% sure I used default settings but alas, I don’t know what default settings for blast were in that era.  And I am not even sure which version of blastp was installed on the TIGR computer systems then.  I certainly need to do a better job of making sure everything I do is truly reproducible.


This all brings me to the actual real part of this story.  Reproducibility.  It is a big deal.  Anyone should be able to reproduce what was done in a study.  And alas, it is difficult to do that when not all the methods are fully described.  And one should also provide intermediate results so that people to do not have to redo everything you did in a study but can just reproduce part of it.   It would be good to have, for example, released all the phylogenetic trees from the analysis of organellar genes in Plasmodium.  Alas, I do not seem to have all of these files as they were stored in a directory at TIGR dedicated to this genome project and as I am no longer at TIGR I do not have ready access to that material.  It is probably still lounging around somewhere on the JCVI computer systems (TIGR alas, no longer officially exists … it was swallowed by the J. Craig Venter Institute …).  But I will keep digging and I will post them to some place like FigShare if/when I find them.

Perhaps more importantly, I will be working with my lab to make sure that in the future we store/record/make available EVERYTHING that would allow people to reproduce, re-analyze, re-jigger, re-whatever anything from our papers.

The key lesson – plan in advance for how you are going to share results, methods, data, etc …

New openaccess paper from my lab on "Zorro" software for automated masking of sequence alignments

A new Open Access paper from my lab was just published in PLoS One: Accounting For Alignment Uncertainty in Phylogenomics. Wu M, Chatterji S, Eisen JA (2012) Accounting For Alignment Uncertainty in Phylogenomics. PLoS ONE 7(1): e30288. doi:10.1371/journal.pone.0030288

The paper describes the software “Zorro” which is used for automated “masking” of sequence alignments.  Basically, if you have a multiple sequence alignment you would like to use to infer a phylogenetic tree, in some cases it is desirable to block out regions of the alignment that are not reliable.  This blocking is called “masking.”

Masking is thought by many to be important because sequence alignments are in essence a hypothesis about the common ancestry of specific residues in different genes/proteins/regions of the genome.  This “positional homology” is not always easy to assign and for regions where positional homology is ambiguous it may be better to ignore such regions when inferring phylogenetic trees from alignments.

Historically, masking has been done by hand/eye looking for columns in a multiple sequence alignment that seem to have issues and then either eliminating those columns or giving them a lower weight and using a weighting scheme in the phylogenetic analysis.

What Zorro does is it removes much of the subjectivity of this process and generates automated masking patterns for sequence alignments.  It does this by assigning confidence scores to each column in a multiple seqeunce alignment. These scores can then be used to account for alignment accuracy in phylogenetic inference pipelines.

The software is available at Sourceforge: ZORRO – probabilistic masking for phylogenetics.  It was written primarily by Martin Wu (who is now a Professor at the University of Virginia) and Sourav Chatterji with a little help here and there from Aaron Darling I think.  The development of Zorro was part of my “iSEEM” project that was supported by the Gordon and Betty Moore Foundation.

In the interest of sharing, since the paper is fully open access, I am posting it here below the fold. UPDATE 2/9 – decided to remove this since it got in the way of getting to the comments …

Evidence for symmetric chromosomal inversions around the replication origin in bacteria

I am posting here my first Open Access article, from Genome Biology in 2000.



Evidence for symmetric chromosomal inversions around the replication origin in bacteria
Jonathan A Eisen , John F Heidelberg, Owen White and Steven L Salzberg

The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA

Genome Biology 2000, 1:research0011.1-0011.9 doi:10.1186/gb-2000-1-6-research0011

Subject areas: Genome studies, Microbiology and parasitology, Evolution

The electronic version of this article is the complete one and can be found online at:


7 August 2000

Revisions received

25 September 2000


19 October 2000


4 December 2000

© 2000





Results and discussion


Materials and methods




Whole-genome comparisons can provide great insight into many aspects of biology. Until recently, however, comparisons were mainly possible only between distantly related species. Complete genome sequences are now becoming available from multiple sets of closely related strains or species.


By comparing the recently completed genome sequences of Vibrio cholerae, Streptococcus pneumoniae and Mycobacterium tuberculosis to those of closely related species – Escherichia coli, Streptococcus pyogenes and Mycobacterium leprae, respectively – we have identified an unusual and previously unobserved feature of bacterial genome structure. Scatterplots of the conserved sequences (both DNA and protein) between each pair of species produce a distinct X-shaped pattern, which we call an X-alignment. The key feature of these alignments is that they have symmetry around the replication origin and terminus; that is, the distance of a particular conserved feature (DNA or protein) from the replication origin (or terminus) is conserved between closely related pairs of species. Statistically significant X-alignments are also found within some genomes, indicating that there is symmetry about the replication origin for paralogous features as well.


The most likely mechanism of generation of X-alignments involves large chromosomal inversions that reverse the genomic sequence symmetrically around the origin of replication. The finding of these X-alignments between many pairs of species suggests that chromosomal inversions around the origin are a common feature of bacterial genome evolution.





Results and discussion


Materials and methods



Large-scale genomic rearrangements and duplications are important in the evolution of species. Previously, these large-scale genome-changing events were studied through genetic or cytological studies. With the availability of many complete genome sequences it is now possible to study such events through comparative genomics. The publication of the yeast genome has led to much better insight into the duplication events that have occurred in fungal and eukaryotic evolution (for example, see [1]). Large chromosomal duplications have also been found from analysis of completed chromosomes of Arabidopsis thaliana [2,3]. The ability to detect large-scale genomic changes is dependent in large part on which genomes are available. Such studies in bacteria, for example, have been limited by the availability of genomes only from distantly related sets of species. Recently, however, the genomes of sets of closely related bacterial species have become available. We have compared these closely related bacterial genomes and have discovered an unusual phenomenon – alignments of whole genomes that show an X-shaped pattern (which we refer to as X-alignments). Here we present the evidence for these X-alignments and discuss mechanisms that might have produced them.


Results and discussion



Results and discussion


Materials and methods




Figure 1

Between-species whole-genome DNA alignments

Figure 2

Whole-genome proteome alignments

Figure 3

Within-genome DNA alignments

Figure 4

Schematic model of genome inversions


Table 1

Whole-genome DNA alignments using MUMmer

Table 2

Whole-genome protein-level comparisons

Whole-genome X-alignments between species at the DNA level

We compared the DNA sequences of the two chromosomes of Vibrio cholerae [4] with the sequence of the Escherichia coli chromosome [5] using a suffix tree alignment algorithm [6]. The analysis revealed a significant alignment at the DNA level between the V. cholerae large chromosome (chrI) [4] and the E. coli chromosome [5] spanning the entire length of these chromosomes (Figure 1a). Analysis of the reverse complement of V. cholerae chrI with E. coli also produced a significant alignment (Figure 1b). When superimposed, the two alignments produce a clear ‘X’ shape (Figure 1c) that is symmetric about the origin of replication of both genomes. This symmetry indicates that matching sequences tend to occur at the same distance from the origin but not necessarily on the same side of the origin. The X-alignment between V. cholerae and E. coli was found to be statistically significant using a test based on the number of matches found in diagonal strips in the alignment (see the Materials and methods section). Specifically, when V. cholerae chrI is aligned in the forward direction against E. coli, there are 459 maximal unique matching subsequences (MUMs; see the Materials and methods section), of which 177 occurred in a diagonal strip covering 10% of the total area (compared to the expected value of 46). The probability of observing this high a number of MUMs by chance is 4.7 × 10-59. The alignment of V. cholerae chrI in the reverse direction against E. coli (which corresponds to the MUMs on the anti-diagonal) has a probability of 1.8 × 10-90. As a control, we compared the genomes of distantly related species, such as E. coli and Mycobacterium tuberculosis. These do not show a significant X-alignment (Table 1).

We have found that X-alignments of whole genomes are not limited to the V. cholerae versus E. coli comparison. For example, a whole-genome comparison of two bacteria in the genus Streptococcus – S. pyogenes [7] and S. pneumoniae (H. Tettelin, personal communication) – reveals a global X-alignment similar to that of V. cholerae versus E. coli (Figure 1d) which is also statistically significant (Table 1). In addition, an X-alignment is found between two species in the genus Mycobacterium – M. tuberculosis [8] and M. leprae [9] (Figure 1e) – as well as between two strains of Helicobacter pylori (data not shown). The X-alignments observed between any two pairs of genomes are not identical in every aspect. For example, in the alignment between the two Mycobacterium species, each conserved region is much longer than in the other genome pairs. We believe this is due to different numbers of evolutionary events between the species (see below). Whole-genome X-alignments were not found between any other pairs of species, although a related pattern was seen between some of the chlamydial species (see below).

Whole-genome X-alignments between species are also found at the proteome level

To test whether the X-alignments found in the DNA analysis could also be found at the level of whole proteomes, we conducted comparisons of homologous proteins between species (see the Materials and methods section). Figure 2a shows a scatterplot of chromosome positions of all proteins homologous between V. cholerae chrI and E. coli. The presence of many large gene families causes a great deal of noise in this comparison. This noise can be reduced by considering only the best matching homolog for each open reading frame (ORF), rather than all protein homologs (Figure 2b). This filtered protein comparison results in an X-alignment that is statistically significant (Table 2).

Whole-genome X-alignments within species

The finding of the X-alignment pattern between species led us to search for similar patterns within species; that is, global alignments of a genome with its own reverse complement. Of the genomes for which we found between-species X-alignments (M. tuberculosis, M. leprae, S. pyogenes, S. pneumoniae, E. coli and V. cholerae), statistically significant self-alignments are detected for all except M. tuberculosis (Figure 3; probabilities shown in Table 1). Interestingly, these self-alignments are not as strong as those between species. Proteome analysis also shows an X-alignment within species (shown for V. cholerae chrI in Figure 2d; probabilities shown in Table 2). The X-alignment of proteins within V. cholerae chrI is statistically significant only for recently duplicated-genes, but disappears when all paralogs are included. The importance of filtering for recent duplications is discussed below.

Model I: whole-genome inverted duplications

One possible explanation for an X-alignment within and between species is an ancestral inverted duplication of the whole genome, as has been suggested for E. coli [10]. The weak or missing X-alignment within species could be explained by gene loss of one of the two duplicates of many of the pairs of genes in the different lineages. Gene loss has been found to follow large chromosomal or genome duplications [11,12,13]. This gene loss is thought to stabilize large duplications by preventing recombination events between duplicate genes. If gene loss is responsible for the weak X-alignment within species, then to maintain the X-alignments between species, the member of the gene pair lost in a particular lineage should be essentially random. If an ancient inverted duplication followed by differential gene loss is the correct explanation for the observed X-alignments, one would expect the genes along one diagonal to be orthologous between species (related to each other by the speciation event), while the genes along the other diagonal should be paralogous (related to each other by the genome duplication event before the speciation of the two lineages). However, the evidence appears to contradict this model: likely orthologous gene pairs are equally distributed on each diagonal (data not shown).

Model II: chromosomal inversions about the origin and/or terminus

A second possible explanation for the X-alignments is that an underlying mechanism allows sections of DNA to move within the genome but maintains the distance of these sections from the origin and/or terminus. There are a variety of possible mechanisms for such movement, but we believe the most likely explanation is the occurrence of large chromosomal inversions that pivot around the replication origin and/or terminus. Large chromosomal inversions, including those that occur around the replication origin and terminus, have been shown to occur in E. coli and Salmonella typhimurium in the laboratory (see, for example, [14,15,16,17,18]). The occurrence of such inversions over evolutionary time scales was first suggested by comparative analysis of the complete genomes of four strains in the genus Chlamydia [19]. In that study, we found that the major chromosomal differences between C. pneumoniae and C. trachomatis (shown in Figure 2c) were consistent with the occurrence of large inversions that pivoted around the origin and terminus (including multiple inversions of different sizes). In Figure 4 we present a hypothetical model showing how a small number of inversions centered around the origin or terminus could produce patterns very similar to those seen in the Chlamydia, Mycobacterium and Helicobacter comparisons. The continued occurrence of such inversion over longer time scales would result in an X-alignment similar to that seen in the V. cholerae versus E. coli and S. pneumoniae versus S. pyogenes comparisons. Thus the different between-species X-alignments could be the result of different numbers of inversions between particular pairs of species.

Inversions about the origin and terminus could also produce an X-alignment within species, through the splitting of tandemly duplicated sequence. Many sets of tandemly duplicated genes are found in most bacterial genomes [19,20] (also see Figure 3a,c). As tandem duplications are inherently unstable (one of the duplicates can be rapidly eliminated by slippage and/or recombination events [21]), the fact that many tandem pairs are present within each genome suggests that tandem duplications occur frequently. Thus, it is reasonable to assume that occasionally a large inversion will split a pair of tandemly duplicated genes. An inversion that pivots about the origin and also splits a tandem duplication will result in a pair of paralogous genes spaced symmetrically on opposite sides of the origin.

If our inversion model is correct, then the genes along both diagonals in the between-species alignments should be orthologous, which is the case (see above). In contrast, genes along the anti-diagonal in the within-species X-alignments should be recent tandem duplicates that have been separated by inversions. This also appears to be the case – in the within-species analysis of V. cholerae chrI ORFs, the X-alignment shows up best when only recent duplicates are analyzed (Figure 2d). The splitting of tandem duplicates by inversions may be a general mechanism to stabilize the coexistence of duplicated genes, as it will prevent their elimination by unequal crossing-over or replication slippage events.

What could cause inversions that pivot around the origin and terminus of the genome to occur more frequently than other inversions? One possibility is that many inversions occur, but there is selection against those that change the distance of a gene from the origin or terminus. Such a possibility has been suggested by experimental work in E. coli [14,15]. Additional studies have, however, suggested that there is little selective difference between inversions and that instead there may be certain regions that are more prone to inversion than others [16,17,18,22,23]. Alternatively, the inversion events could be linked to replication, as has been suggested for small local inversion events [24]. Whatever the mechanisms, the fact that we find evidence for such inversions between many pairs of species suggests that they are a common feature of bacterial evolution. Many aspects of the X-alignments require further exploration. For example, to split a tandem duplication, an inversion must fall precisely on the boundary between two duplicated genes. This would appear to be unlikely, requiring a large number of inversions in order to generate a sufficient number of split gene pairs. If the mechanisms of gene duplication are somehow related to the mechanisms of inversion, however, then this model is more plausible. The process of duplicating a gene, if it occurs during replication, might promote a recombination event within the bacterial chromosome that inverts the sequence from the origin up to that point. As with inversion events, recombination and replication have been found to be tightly coupled [25].


We present here a novel observation regarding the conservation between bacterial species of the distance of particular genes from the replication origin or terminus. The initial observation was only possible due to the availability of complete genome sequences from pairs of moderately closely related species (for example, V. cholerae and E. coli). This shows the importance of having genome pairs from many levels of evolutionary relatedness. Comparisons of distantly related species enable the determination of universal features of life as well as of events that occur very rarely. Comparison of very closely related species allows the identification of frequent events such as transitional changes at third codon positions or tandem duplications. To elucidate all other events in the history of life, genome pairs covering all the intermediate levels of evolutionary relatedness will be needed.


Materials and methods



Results and discussion


Materials and methods



Genomes analyzed

Complete published genome sequences were obtained from the National Center for Biotechnology Information website [26] or from the TIGR Comprehensive Microbial Resource [27]. These included Aeropyrum pernix [28], Aquifex aeolicus [29], Archaeoglobus fulgidus [30], Bacillus subtilis [31], Borrelia burgdorferi [32], Campylobacter jejuni [33], Chlamydia pneumoniae AR39 [19], Chlamydia pneumoniae CWL029 [34], Chlamydia trachomatis (D/UW-3/Cx) [35], Chlamydia trachomatis MoPn [19], Deinococcus radiodurans [36], Escherichia coli [5], Haemophilus influenzae [37], Helicobacter pylori [38], Helicobacter pylori J99 [39], Methanobacterium thermoautotrophicum [40], Methanococcus jannaschii [41], Mycobacterium tuberculosis [8], Mycoplasma genitalium [42], Mycoplasma pneumoniae [43], Neisseria meningitidis MC58 [20], Neisseria meningitidis serogroup A strain Z2491 [44], Pyrococcus horikoshii [45], Rickettsia prowazekii [46], Synechocystis sp. [47], Thermotoga maritima [48], Treponema pallidum [49], and Vibrio cholerae [4]. In addition, a few unpublished genomes were analyzed: Streptococcus pyogenes (obtained from the Oklahoma University Genome Center website [7]), Streptococcus pneumoniae (H. Tettelin, personal communication), and Mycobacterium leprae (obtained from the Sanger Centre Pathogen Sequencing Group website [9]).

Whole-genome DNA alignments

DNA alignments of the complete genomic sequences of all bacteria used in this study were accomplished with the MUMmer program [6]. This program uses an efficient suffix tree construction algorithm to rapidly compute alignments of entire genomes. The algorithm identifies all exact matches of nucleotide subsequences that are contained in both input sequences; these exact matches must be longer than a specified minimum length, which was set to 20 base pairs for this comparison. To search for genome-scale alignments within species, complete bacterial and archaeal genomes (25 in total including all published genomes) were aligned with their own reverse complements. To search for between-species alignments, all genomes were aligned against all others in both orientations.

Whole-genome protein comparisons

The predicted proteome of each complete genome sequence (all predicted proteins in the genome) was compared to the proteomes of all complete genome sequences (including itself) using the fasta3 program [50]. Matches with an expected score (e-value) of 10-5 or less were considered significant.

Statistical significance of X-alignments

To calculate the statistical significance of the X-alignments, the maximal unique matching subsequences (MUMs) for unrelated genomes were examined and found to be uniformly distributed [6]. With a uniform background, the expected density of MUMs in any region of an alignment plot is a simple proportion of the area of that region to the entire plot. In particular, in an alignment with N total MUMs, the probability (Pr) of observing at least m matches in a region with area p can be computed using the binomial distribution in Equation 1:

The alignment of V. cholerae chrI (both forward and reverse strands) versus E. coli contains 926 MUMs. The MUMs forming X-alignments appear along the diagonal (y = x) and the anti-diagonal (y = L -x, where L is the genome length). To estimate the significance of the alignments in both directions, diagonal strips were sampled along each of the diagonals. The width of each strip was set at 10% of the plot area and significance values were calculated (Table 1).

Identification of origins of replication

The origins of replication for the bacterial genomes have been characterized by a variety of methods. For E. coli, M. tuberculosis and M. leprae, the origins have been well-characterized by laboratory studies [51,52]. The origins and termini of C. trachomatis, C. pneumoniae and V. cholerae were identified by GC-skew [53] and by characteristic genes in the region of the origin [4,19]. GC-skew uses the function (G-C)/(G+C) computed on 2,000 bp windows across the genome, which exhibits a clear tendency in many bacterial genomes to be positive for the leading strand and negative for the lagging strand. The origin of H. pylori was determined by oligomer skew [54] and confirmed by GC-skew. The origins and termini of S. pneumoniae and S. pyogenes were determined by the authors of the present study using GC-skew analysis and the locations of characteristic genes, particularly the chromosome replication initiator gene dnaA.





Results and discussion


Materials and methods



We thank S. Eddy, M.A. Riley, T. Read, A. Stoltzfus, M-I Benito and I. Paulsen for helpful comments, suggestions and discussions. S.L.S. was supported in part by NSF grant IIS-9902923 and NIH grant R01 LM06845. S.L.S. and J.A.E were supported in part by NSF grant KDI-9980088. Data for all published complete genome sequences were obtained from the NCBI genomes database [26] or from The Institute for Genomic Research (TIGR) Microbial Genome Database [27]. The sequences of V. cholerae, S. pneumoniae, and M. tuberculosis (CDC 1551) were determined at TIGR with support from NIH and the NIAID. The M. leprae sequence data were produced by the Pathogen Sequencing Group at the Sanger Centre. Sequencing of M. leprae is funded by the Heiser Program for Research in Leprosy and Tuberculosis of The New York Community Trust and by L’Association Raoul Follereau. The M. tuberculosis CDC 1551 genome sequence was obtained from TIGR. The source of the S. pyogenes genome sequence was the Streptococcal Genome Sequencing Project funded by USPHS/NIH grant AI38406, and was kindly made available by B. A. Roe, S.P. Linn, L. Song, X. Yuan, S. Clifton, R.E. McLaughlin, M. McShan and J. Ferretti, and can be obtained from the website of the Oklahoma University Genome Center [7].





Results and discussion


Materials and methods




Seoighe C, Wolfe KH: Updated map of duplicated regions in the yeast genome.

Gene 1999, 238:253-261. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Lin X, Kaul S, Rounsley S, Shea TP, Benito MI, Town CD, Fujii CY, Mason T, Bowman CL, Barnstead M, et al.: Sequence and analysis of chromosome 2 of the plant Arabidopsis thaliana.

Nature 1999, 402:761-768. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Mayer K, Schuller C, Wambutt R, Murphy G, Volckaert G, Pohl T, Dusterhoft A, Stiekema W, Entian KD, Terryn N, et al.: Sequence and analysis of chromosome 4 of the plant Arabidopsis thaliana.

Nature 1999, 402:769-777. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Heidelberg JF, Eisen JA, Nelson WC, Clayton RA, Gwinn ML, Dodson RJ, Haft DH, Hickey EK, Peterson JD, Umayam L, et al.: The genome sequence of Vibrio cholerae, the aetiologic agent of cholera.

Nature 2000, 406:477-483. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1] [2] [3] [4]


Blattner FR, Plunkett GI, Bloch CA, Perna NT, Burland V, Riley M, Collado-Vides J, Glasner JD, Rode CK, Mayhew GF, et al.: The complete genome sequence of Escherichia coli K-12.

Science 1997, 277:1453-1462. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1] [2] [3]


Delcher AL, Kasif S, Fleischmann RD, Peterson J, White O, Salzberg SL: Alignment of whole genomes.

Nucleic Acids Res 1999, 27:2369-2376. [PubMed Abstract] [Publisher Full Text] [PubMed Central Full Text]

Return to citation in text: [1] [2] [3]


Oklahoma University Genome Center []

Return to citation in text: [1] [2] [3]


Cole ST, Brosch R, Parkhill J, Garnier T, Churcher C, Harris D, Gordon SV, Eiglmeier K, Gas S, Barry CE III, et al.: Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence.

Nature 1998, 393:537-544. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1] [2]


Sanger Centre Pathogen Sequencing Group []

Return to citation in text: [1] [2]


Zipkas D, Riley M: Proposal concerning mechanism of evolution of the genome of Escherichia coli.

Proc Natl Acad Sci USA 1975, 72:1354-1358. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Wagner A: The fate of duplicated genes: loss or new function?

BioEssays 1998, 20:785-788. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Lynch M, Force A: The probability of duplicate gene preservation by subfunctionalization.

Genetics 2000, 154:459-473. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Nadeau JH, Sankoff D: Comparable rates of gene loss and functional divergence after genome duplications early in vertebrate evolution.

Genetics 1997, 147:1259-1266. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Francois V, Louarn J, Patte J, Rebollo JE, Louarn JM: Constraints in chromosomal inversions in Escherichia coli are not explained by replication pausing at inverted terminator-like sequences.

Mol Microbiol 1990, 4:537-542. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1] [2]


Rebollo JE, Francois V, Louarn JM: Detection and possible role of two large nondivisible zones on the Escherichia coli chromosome.

Proc Natl Acad Sci USA 1988, 85:9391-9395. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1] [2]


Segall A, Mahan MJ, Roth JR: Rearrangement of the bacterial chromosome: forbidden inversions.

Science 1988, 241:1314-1318. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1] [2]


Mahan MJ, Roth JR: Ability of a bacterial chromosome segment to invert is dictated by included material rather than flanking sequence.

Genetics 1991, 129:1021-1032. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1] [2]


Segall AM, Roth JR: Recombination between homologies in direct and inverse orientation in the chromosome of Salmonella : intervals which are nonpermissive for inversion formation.

Genetics 1989, 122:737-747. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1] [2]


Read TD, Brunham RC, Shen C, Gill SR, Heidelberg JF, White O, Hickey EK, Peterson J, Utterback T, Berry K, et al.: Genome sequences of Chlamydia trachomatis MoPn and Chlamydia pneumoniae AR39.

Nucleic Acids Res 2000, 28:1397-1406. [PubMed Abstract] [Publisher Full Text] [PubMed Central Full Text]

Return to citation in text: [1] [2] [3] [4] [5]


Tettelin H, Saunders NJ, Heidelberg J, Jeffries AC, Nelson KE, Eisen JA, Ketchum KA, Hood DW, Peden JF, Dodson RJ, et al.: Complete genome sequence of Neisseria meningitidis serogroup B strain MC58.

Science 2000, 287:1809-1815. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1] [2]


Force A, Lynch M, Pickett FB, Amores A, Yan YL, Postlethwait J: Preservation of duplicate genes by complementary, degenerative mutations.

Genetics 1999, 151:1531-1545. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Schmid MB, Roth JR: Selection and endpoint distribution of bacterial inversion mutations.

Genetics 1983, 105:539-557. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Mahan MJ, Roth JR: Reciprocality of recombination events that rearrange the chromosome.

Genetics 1988, 120:23-35. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Gordon AJ, Halliday JA: Inversions with deletions and duplications.

Genetics 1995, 140:411-414. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Valencia-Morales E, Romero D: Recombination enhancement by replication (RER) in Rhizobium etli.

Genetics 2000, 154:971-983. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


National Center for Biotechnology Information, Entrez Genomes []

Return to citation in text: [1] [2]


The Institute for Genomic Research Microbial Genome Database []

Return to citation in text: [1] [2]


Kawarabayasi Y, Hino Y, Horikawa H, Yamazaki S, Haikawa Y, Jin-no K, Takahashi M, Sekine M, Baba S, Ankai A, et al.: Complete genome sequence of an aerobic hyper-thermophilic crenarchaeon, Aeropyrum pernix K1.

DNA Res 1999, 6:83-101. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Deckert G, Warren PV, Gaasterland T, Young WG, Lenox AL, Grahams DE, Overbeek R, Snead MA, Keller M, Aujay M, et al.: The complete genome of the hyperthemophilic bacterium Aquifex aeolicus.

Nature 1998, 392:353-358. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Klenk H-P, Clayton RA, Tomb J-F, White O, Nelsen KE, Ketchum KA, Dodson RJ, Gwinn M, Hickey EK, Peterson JD, et al.: The complete genomic sequence of the hyperthermophilic, sulfate-reducing archaeon Archaeoglobus fulgidus.

Nature 1997, 390:364-370. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Kunst A, Ogasawara N, Moszer I, Albertini A, Alloni G, Azevedo V, Bertero M, Bessieres P, Bolotin A, Borchert S, et al.: The complete genome sequence of the Gram-positive bacterium Bacillus subtilis.

Nature 1997, 390:249-256. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Fraser CM, Norris SJ, Weinstock GM, White O, Sutton GG, Dodson R, Gwinn M, Hickey EK, Clayton R, Ketchum KA, et al.: Genomic sequence of a Lyme disease spirochaete, Borrelia burgdorferi.

Nature 1997, 390:580-586. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Parkhill J, Wren BW, Mungall K, Ketley JM, Churcher C, Basham D, Chillingworth T, Davies RM, Feltwell T, Holroyd S, et al.: The genome sequence of the food-borne pathogen Campylobacter jejuni reveals hypervariable sequences.

Nature 2000, 403:665-668. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Kalman S, Mitchell W, Marathe R, Lammel C, Fan J, Hyman RW, Olinger L, Grimwood J, Davis RW, Stephens RS: Comparative genomes of Chlamydia pneumoniae and C. trachomatis.

Nat Genet 1999, 21:385-389. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Stephens RS, Kalman S, Lammel C, Fan J, Marathe R, Aravind L, Mitchell W, Olinger L, Tatusov RL, Zhao Q, et al.: Genome sequence of an obligate intracellular pathogen of humans: Chlamydia trachomatis.

Science 1998, 282:754-759. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


White O, Eisen JA, Heidelberg JF, Hickey EK, Peterson JD, Dodson RJ, Haft DH, Gwinn ML, Nelson WC, Richardson DL, et al.: Genome sequence of the radioresistant bacterium Deinococcus radiodurans R1.

Science 1999, 286:1571-1577. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM, et al.: Whole-genome random sequencing and assembly of Haemophilus influenzae Rd.

Science 1995, 269:496-512. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Tomb JF, White O, Kerlavage AR, Clayton RA, Sutton GG, Fleischmann RD, Ketchum KA, Klenk HP, Gill S, Dougherty BA, et al.: The complete genome sequence of the gastric pathogen Helicobacter pylori.

Nature 1997, 388:539-547. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Alm RA, Ling LS, Moir DT, King BL, Brown ED, Doig PC, Smith DR, Noonan B, Guild BC, deJonge BL, et al.: Genomic-sequence comparison of two unrelated isolates of the human gastric pathogen Helicobacter pylori.

Nature 1999, 397:176-180. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Smith DR, Doucette-Stamm LA, Deloughery C, Lee H, Dubois J, Aldredge T, Bashirzadeh R, Blakely D, Cook R, Gilbert K, et al.: Complete genome sequence of Methanobacterium thermoautotrophicum DH: functional analysis and comparative genomics.

J Bacteriol 1996, 179:7135-7155.

Return to citation in text: [1]


Bult CJ, White O, Olsen GJ, Zhou L, Fleischmann RD, Sutton GG, Blake JA, Fitzgerald LM, Clayton RA, Gocayne JD, et al.: Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii.

Science 1996, 273:1058-1073. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Fraser CM, Gocayne JD, White O, Adams MD, Clayton RA, Fleischmann RD, Bult CJ, Kerlavage AR, Sutton G, Kelley JM, et al.: The minimal gene complement of Mycoplasma genitalium.

Science 1995, 270:397-403. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Himmelreich R, Hilbert H, Plagens H, Pirkl E, Li BC, Herrmann R: Complete sequence analysis of the genome of the bacterium Mycoplasma pneumoniae.

Nucleic Acids Res 1996, 24:4420-4449. [PubMed Abstract] [Publisher Full Text] [PubMed Central Full Text]

Return to citation in text: [1]


Parkhill J, Achtman M, James KD, Bentley SD, Churcher C, Klee SR, Morelli G, Basham D, Brown D, Chillingworth T, et al.: Complete DNA sequence of a serogroup A strain of Neisseria meningitidis Z2491.

Nature 2000, 404:502-506. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Kawarabayasi Y, Sawada M, Horikawa H, Haikawa Y, Hino Y, Yamamoto S, Sekine M, Baba S, Kosugi H, Hosoyama A, et al.: Complete sequence and gene organization of the genome of a hyperthermophilic archaebacterium, Pyrococcus horikoshii OT3.

DNA Res 1998, 5:55-76. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Andersson SG, Zomorodipour A, Andersson JO, Sicheritz-Ponten T, Alsmark UC, Podowski RM, Naslund AK, Eriksson AS, Winkler HH, Kurland CG: The genome sequence of Rickettsia prowazekii and the origin of mitochondria.

Nature 1998, 396:133-140. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Kaneko T, Sato S, Kotani H, Tanaka A, Asamizu E, Nakamura Y, Miyajima N, Hirosawa M, Sugiura M, Sasamoto S, et al.: Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions.

DNA Res 1996, 3:109-136. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Nelson KE, Clayton RA, Gill SR, Gwinn ML, Dodson RJ, Haft DH, Hickey EK, Peterson JD, Nelson WC, Ketchum KA, et al.: Evidence for lateral gene transfer between Archaea and bacteria from genome sequence of Thermotoga maritima.

Nature 1999, 399:323-329. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Fraser CM, Norris SJ, Weinstock GM, White O, Sutton GG, Dodson R, Gwinn M, Hickey EK, Clayton R, Ketchum KA, et al.: Complete genome sequence of Treponema pallidum, the syphilis spirochete.

Science 1998, 281:375-388. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Pearson WR: Flexible sequence similarity searching with the FASTA3 program package.

Methods Mol Biol 2000, 132:185-219. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Marsh RC, Worcel A: A DNA fragment containing the origin of replication of the Escherichia coli chromosome.

Proc Natl Acad Sci USA 1977, 74:2720-2724. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Salazar L, Fsihi H, de Rossi E, Riccardi G, Rios C, Cole ST, Takiff HE: Organization of the origins of replication of the chromosomes of Mycobacterium smegmatis, Mycobacterium leprae and Mycobacterium tuberculosis and isolation of a functional origin from M. smegmatis.

Mol Microbiol 1996, 20:283-293. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Lobry JR: Asymmetric substitution patterns in the two DNA strands of bacteria.

Mol Biol Evol 1996, 13:660-665. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]


Salzberg SL, Salzberg AJ, Kerlavage AR, Tomb JF: Skewed oligomers and origins of replication.

Gene 1998, 217:57-67. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

Tackling the hairy beast – Tetrahymena genome

Just thought I would put out a little self-promotional posting here on a paper we have published today on the genome of a very interesting organism called Tetrahymena thermophila. This organism is a single-celled eukaryote that lives in fresh water ponds.

This species has served as a powerful model organism for studies of the workings of eukaryotic cells. Studies of this species have led to some fundamental discoveries about how life works. For example, telomerase, the enzyme that helps keep the ends of linear chromsomes from degrading, was discovered in this species. This may not seem too important, but many folks think that degradation of chromosome ends in humans is involved in aging. Perhaps even more importantly, (to me at least) studies of this species were fundamental to the discovery that RNA can be an enzyme. This discovery of catalytic RNA revolutionized our understanding of how cells work and how life evolved. Tom Cech and Sidney Altman were given the Nobel Prize in 1989 for this discovery.

Many (including myself) believe that having the genome sequence of this species will further spur research and its use as a model organism. In addition, we believe that some of the findings we report in our paper will further cement the importace of this species. For example, this species, though single celed, encodes nearly as many proteins as humans and possesses many processes and pathways shared with animals but missing from other model single celled species.

The project that led to this publication was undertaken while I was at TIGR (The Institute for Genomic Research) and involved a collaboration among people at dozens of research institutions around the world. It all started in 2001 when Ed Orias and his colleagues sought to see if anyone at TIGR would be interested in putting in a grant to sequence this species’ genome. I responded to the email saying I was interested, especially since I had interacted with multiple people who used this species as a model system (e.g., Laura Landweber at Princeton and Laura Katz at Smith). So I went to a FASEB meeting where the Tetrahymena Genome Steering Committee was meeting and discussed with them how TIGR might help sequence the genome. And after talking to other genome centers, they selected TIGR to put in a grant proposal with them.

We ended up getting funding from two grant proposals – one from NIGMS and the other from the NSF Microbial Genome Sequencing Program. The sequencing was done in a rapid burst at the new Joint Technology Center which TIGR shares with the Venter Institute. And then we spent ~1.5 years analyzing the sequence data (and assemblies) that came out and in the end we fortunately were able to get our paper into PLoS Biology, in my opinion the best place available to publish biology research.

Importantly PLoS Biology is Open Access which allows anyone anywhere to read about our work. This goes well with the free and open release we made of the genome sequence data. In fact, many people published papers on the genome before we did (sometimes scooping us). In the end, I accepted the risks of releasing the genome data with no restrictions inexchange for advancing research on this organisms. I think this risk was well worth it as we still got our big paper published and the field has advanced more rapidly than if we had not released the data.

Other links that may be of interest to people:

Eisen, J., Coyne, R., Wu, M., Wu, D., Thiagarajan, M., Wortman, J., Badger, J., Ren, Q., Amedeo, P., Jones, K., Tallon, L., Delcher, A., Salzberg, S., Silva, J., Haas, B., Majoros, W., Farzad, M., Carlton, J., Smith, R., Garg, J., Pearlman, R., Karrer, K., Sun, L., Manning, G., Elde, N., Turkewitz, A., Asai, D., Wilkes, D., Wang, Y., Cai, H., Collins, K., Stewart, B., Lee, S., Wilamowska, K., Weinberg, Z., Ruzzo, W., Wloga, D., Gaertig, J., Frankel, J., Tsao, C., Gorovsky, M., Keeling, P., Waller, R., Patron, N., Cherry, J., Stover, N., Krieger, C., del Toro, C., Ryder, H., Williamson, S., Barbeau, R., Hamilton, E., & Orias, E. (2006). Macronuclear Genome Sequence of the Ciliate Tetrahymena thermophila, a Model Eukaryote PLoS Biology, 4 (9) DOI: 10.1371/journal.pbio.0040286