Linnaeus at 300

So – normally I would not post anything here regarding stuff in Nature, due to Nature’s incomplete Open Access policies but I am doing so here because the relevant material is 50-50 open/closed. They material I refer to is a pretty interesting set of articles in the latest issue celebrating the 300th birthday of some guy named Linnaeus.

Among the articles some are free access some are not. Below is the list of commentaries/news/reviews and info on their accessibility. I am particularly fond of the one by John Whitfield (and not just because I am in it). Unfortunately it is not “Free Access.” But it does have some interesting tidbits on “phylogenomics” — so if you have access – check it out. If not, well post something to his blog complaining about that.

Editorial

The legacy of Linnaeus Free access

Nature 446, 231 (15 March 2007) doi:10.1038/446231b


Top

Features

News Features

Linnaeus at 300: We are family <!– Free access–>

John Whitfield

Nature 446, 247 (15 March 2007) doi:10.1038/446247a


News Features

Linnaeus at 300: The species and the specious Free access

Emma Marris

Nature 446, 250 (15 March 2007) doi:10.1038/446250a


News Features

Linnaeus at 300: The big name hunters <!– Free access–>

Brendan Borrell

Nature 446, 253 (15 March 2007) doi:10.1038/446253a


News Features

Linnaeus at 300: The royal raccoon from Swedesboro Free access

Henry Nicholls

Nature 446, 255 (15 March 2007) doi:10.1038/446255a


Top

Commentaries

Commentaries

Linnaeus in the information age <!– Free access–>

H. C. J. Godfray

Nature 446, 259 (15 March 2007) doi:10.1038/446259a


Commentaries

Spreading the word <!– Free access–>

Sandra Knapp, Andrew Polaszek and Mark Watson

Nature 446, 261 (15 March 2007) doi:10.1038/446261a

An innoculated mind

Here’s a somewhat self serving recommendation for people to check out the blog of Karl Mogel. He is a Davisite (who used to write for the U. C. Davis student paper) who has a radio show on Science and he has lots of interesting stuff in his blog about evolution. So check it out. Oh, and he put his interview of me online … haven’t listened yet but it can be found here.

Genomics Dub Collective

Just thought I would post this email I got regarding the Genomics Dub Collective. I recieved this a while ago and forgot to make the post live … so here it is. This stuff is, well, you have to check it out if you are interested in evolution.

The Genomic Dub Collective pleased to announce that all the videos
and MP3s associated with the Origin of Species of Dub have now been
released on to the web in time for Darwin Day (Feb 12th) 2007.

You can access the videos here:http://www.infection.bham.ac.uk/BPAG/
Dub/Videos/
and the MP3s here:http://www.infection.bham.ac.uk/BPAG/Dub/
free_OriginMP3s/

We hope you find them stimulating and thought-provoking, as well as
entertaining. We have provided copious notes for anyone interested
in
following up the background to the issues raised. Please take time
to
rate the videos in YouTube and pass this information on to anyone
who
might be interested.

And if you don’t like them, apologies for the intrusion.

Cheers

Mark

Professor Mark Pallen
Professor of Microbial Genomics
Centre for Systems Biology,
School of Biosciences,
University of Birmingham, BIRMINGHAM, B15 2TT

“Scientific work must not be considered from the point of view of
the
direct usefulness of it. It must be done for itself, for the beauty
of science, and then there is always the chance that a scientific
discovery may become a benefit”
Marie Curie

Glassy winged sharpshooter symbionts

As in earlier posts — I am posting one of my Open Access publications here … this one is on genomics of symbionts of the glassy winged sharpshooter. The citation is

Citation: Wu D, Daugherty SC, Van Aken SE, Pai GH, Watkins KL, et al. (2006) Metabolic Complementarity and Genomics of the Dual Bacterial Symbiosis of Sharpshooters. PLoS Biol 4(6): e188 doi:10.1371/journal.pbio.0040188

Metabolic Complementarity and Genomics of the Dual Bacterial Symbiosis of Sharpshooters

Dongying Wu1, Sean C. Daugherty1, Susan E. Van Aken2, Grace H. Pai2, Kisha L. Watkins1, Hoda Khouri1, Luke J. Tallon1, Jennifer M. Zaborsky1, Helen E. Dunbar3, Phat L. Tran3, Nancy A. Moran3, Jonathan A. Eisen1*¤

1 The Institute for Genomic Research, Rockville, Maryland, United States of America, 2 J. Craig Venter Institute, Joint Technology Center, Rockville, Maryland, United States of America, 3 Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, Arizona, United States of America

Mutualistic intracellular symbiosis between bacteria and insects is a widespread phenomenon that has contributed to the global success of insects. The symbionts, by provisioning nutrients lacking from diets, allow various insects to occupy or dominate ecological niches that might otherwise be unavailable. One such insect is the glassy-winged sharpshooter (Homalodisca coagulata), which feeds on xylem fluid, a diet exceptionally poor in organic nutrients. Phylogenetic studies based on rRNA have shown two types of bacterial symbionts to be coevolving with sharpshooters: the gamma-proteobacterium Baumannia cicadellinicola and the Bacteroidetes species Sulcia muelleri. We report here the sequencing and analysis of the 686,192–base pair genome of B. cicadellinicola and approximately 150 kilobase pairs of the small genome of S. muelleri, both isolated from H. coagulata. Our study, which to our knowledge is the first genomic analysis of an obligate symbiosis involving multiple partners, suggests striking complementarity in the biosynthetic capabilities of the two symbionts: B. cicadellinicola devotes a substantial portion of its genome to the biosynthesis of vitamins and cofactors required by animals and lacks most amino acid biosynthetic pathways, whereas S. muelleri apparently produces most or all of the essential amino acids needed by its host. This finding, along with other results of our genome analysis, suggests the existence of metabolic codependency among the two unrelated endosymbionts and their insect host. This dual symbiosis provides a model case for studying correlated genome evolution and genome reduction involving multiple organisms in an intimate, obligate mutualistic relationship. In addition, our analysis provides insight for the first time into the differences in symbionts between insects (e.g., aphids) that feed on phloem versus those like H. coagulata that feed on xylem. Finally, the genomes of these two symbionts provide potential targets for controlling plant pathogens such as Xylella fastidiosa, a major agroeconomic problem, for which H. coagulata and other sharpshooters serve as vectors of transmission.

Funding. Funding was from National Science Foundation Biocomplexity grants 9978518 and 0313737.

Academic Editor: Julian Parkhill, The Sanger Institute, United Kingdom

Citation: Wu D, Daugherty SC, Van Aken SE, Pai GH, Watkins KL, et al. (2006) Metabolic Complementarity and Genomics of the Dual Bacterial Symbiosis of Sharpshooters. PLoS Biol 4(6): e188 doi:10.1371/journal.pbio.0040188

Received: October 21, 2005; Accepted: April 10, 2006; Published: June 6, 2006

Copyright: © 2006 Wu et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Abbreviations: CDS, protein-coding gene; LPS, lipopolysaccharide; pI, isoelectric point; SNP, single nucleotide polymorphism

* To whom correspondence should be addressed. E-mail: jaeisen@ucdavis.edu

¤ Current address: UC Davis Genome Center, Department of Medical Microbiology and Immunology and Section of Evolution and Ecology, University of California Davis, Davis, California, United States of America

Introduction

Through mutualistic symbioses with bacteria, eukaryotes have been able to acquire metabolic capabilities that in turn have allowed the utilization of otherwise unavailable ecological niches. Among the diverse examples of such symbioses, those involving bacteria that live inside the cells of their host are of great interest. These “endo”-symbioses played a central role in the early evolution of eukaryotes (e.g., the establishment of the mitochondria and chloroplasts) and in many more recent diversification events such as animals living at deep-sea vents, corals, blood-feeding flies, carpenter ants, and several clades of sap-feeding insects.

Insects that feed primarily or entirely on sap are a virtual breeding ground for symbioses because this liquid rarely contains sufficient quantities of the nutrients that animals are unable to make for themselves. For example, the sole diet of most aphids is sap from phloem which is the component of the plant vascular system normally used to transport sugars and other organic nutrients. Despite the presence of many nutrients, phloem usually has little, if any, of the “essential” amino acids that cannot be synthesized by animals. To compensate, aphids engage in an obligate symbiosis with bacteria in the genus Buchnera, which, in exchange for sugar and simple, nonessential amino acids, synthesize the needed essential amino acids for their hosts.

The exact details of aphid-Buchnera interactions have been difficult to determine because no Buchnera has been cultivated outside its host. This limitation has been circumvented to a large degree by sequencing and analysis of multiple Buchnera genomes [13], which have provided detailed insights into the biology, evolution, and ecology of these symbioses. For example, despite having undergone massive amounts of gene loss in the time after they diverged from free-living Gammaproteobacteria, the Buchnera encode many pathways for the synthesis of essential amino acids. A critical component of these genomic studies is that, in most aphids, Buchnera is the only symbiont [4]. This implies that when genome-based metabolic pathway reconstructions suggest that a particular Buchnera is unable to make all the essential nutrients for its host, either the reconstructions are wrong, or the host must be getting those nutrients from its diet. For example, although one of the Buchnera strains is predicted to not be able to incorporate inorganic sulfur for the production of cysteine and other compounds, sulfur-containing organic compounds are known to occur in the diet of its host aphid [2].

In many other sap-feeding insects, including some aphids, several heritable bacterial types are found often living in close proximity within specialized structures in the insect body (e.g., [59]). This is apparently the case for all insects that are strict xylem-sap feeders, which include cicadas, spittlebugs, and some leafhoppers [5]. Xylem is the component of the plant vascular system that is primarily used to transport water and salts from the roots to the rest of the plant. Xylem sap has the lowest nitrogen or carbon content of any plant component and contains few organic compounds [10]. Although the composition varies among plant species and developmental stages, xylem fluid is always nutrient-poor, containing mostly inorganic compounds and minerals with small amounts of amino acids and organic acids [1115]. As in phloem, the amino acids consist mainly of nonessential types such as glutamine, asparagine, and aspartic acid, with all essential ones absent or present in very low amounts.

Among xylem-feeders, sharpshooters (Insecta: Hemiptera: Cicadellidae: Cicadellinae) are a prominent group of about 2,000 species [10], many of which are major pests of agriculture due to their roles as vectors of plant pathogens. Sharpshooters are known to possess two bacterial symbionts. One, called Candidatus Baumannia cicadellinicola (hereafter Baumannia), resembles Buchnera in having small genome size and a biased nucleotide composition favoring adenine and thymine (A + T) and in belonging to the Enterobacteriales group in the Gammaproteobacteria [16]. The other symbiont, which was recently named Candidatus Sulcia muelleri (hereafter Sulcia), is in the Bacteroidetes phylum (formerly called the Cytophaga-Flexibacter-Bacteroides, or CFB, phylum) and is distributed widely in related insect hosts [9]. Both symbionts are vertically transmitted in eggs and are housed in a specialized bacteriome within developing nymphs and adults, and molecular phylogenetic studies show that both symbionts represent ancient associations dating to the origin of sharpshooters (Baumannia) or earlier (Sulcia) [5,9,16].

We sought to apply genome sequencing and analysis methods to the sharpshooter symbioses. For a host species, we selected the glassy-winged sharpshooter, Homalodisca coagulata. This pest species has a rapidly expanding geographic range and inflicts major crop damage as a vector for the bacterium Xylella fastidiosa, the agent of Pierce’s disease of grapes and other plant diseases [10]. Initially, we focused on the Baumannia symbiont with the idea that comparisons with the related Buchnera species would allow us to better identify differences that related to xylem versus phloem feeding. After completing the genome of this Baumannia, analysis revealed that many pathways that we expected to be present were missing. In contrast to the Buchnera-aphid symbioses, a second symbiont is present in sharpshooters, so we could not assume that the nutrients that would have been made by the missing pathways must be in the sharpshooter diet. Despite technical difficulties, we were able to obtain a significant portion of the genome of the Sulcia from the same wild-caught samples of H. coagulata.

Here we present the analysis of these two genomic datasets and the striking finding that the symbionts appear to work in concert, and possibly even share metabolites, to produce all of the nutrients needed by the host to survive on its xylem diet.

Results/Discussion

General Features of the Baumannia Genome and Predicted Genes

The genome of Baumannia consists of one circular chromosome of 686,192 base pairs (bp) with an average G + C content of 33.23% (Table 1). The genome size closely matches an earlier estimate from gel electrophoresis [16]. Baumannia has neither a strong GC skew pattern nor a dnaA homolog—two features commonly used to identify origins of replication in bacteria. A putative origin was identified and designated as position 1, based on a weak but clear transition in oligonucleotide skew.

thumbnail

Table 1.

General Features of the Genomes of Baumannia and Other Insect Endosymbionts

A total of 46 noncoding RNA genes were identified: six rRNAs (two sets of 16S, 5S, and 23S), one small RNA, and 39 tRNAs including at least one for each of the 20 amino acids. A total of 605 putative protein-coding genes (CDSs) were identified in the genome, and 89.9% of these can be assigned a putative biological function. An overview of the Baumannia genome and its encoded genes is illustrated in Figure 1, and features of these genes are summarized in Table S1. Only four of the CDSs lack detectable homologs in GenBank or other complete genomes and thus can be considered “orphan” genes.

thumbnail

Figure 1. Circular View of the Baumannia Genome

Circles correspond to the following features, starting with the outermost circle: (1) forward strand genes, (2) reverse strand genes, (3) χ2 deviation of local nucleotide composition from the genome average, (4) GC skew, (5) tRNAs (green lines), (6) rRNAs (blue lines); and (7) small RNAs (red lines). Color legend for CDSs and number of genes in each category are at the bottom.

Evolution of Baumannia and the Genomes of Intracellular Organisms

Genome sequences have been found to be very useful in providing for better resolution and accuracy in phylogenetic trees than is achieved using single genes such as rRNA genes [17]. Although there are many ways to build genome-based trees, one particularly powerful approach is to identify orthologous genes between species and to combine alignments of these genes into a single alignment. We built a tree for Baumannia and related species from 45 ribosomal proteins using this concatenation approach (Figure 2A). This tree supports the rRNA-based grouping of Baumannia with the insect endosymbionts of the genera Buchnera, Wigglesworthia (symbionts of tsetse flies), and Blochmania (symbionts of ants) [16]. However, the branching order is different in the protein tree with Baumannia being the deepest branching symbiont. As in prior genomic studies [18], the insect endosymbionts in the tree in Figure 2A are monophyletic (i.e., they share a common ancestor to the exclusion of all other species for which genomes are available). A possible close relationship of Baumannia to the other symbionts in the group is further supported by the presence of a substantial number of segments of conserved gene order (Figure 2B).

thumbnail

Figure 2. Genome-Based Phylogenetic Analysis of Baumannia

(A) Maximum-likelihood tree of gamma-proteobacterial endosymbionts. The tree was built from concatenated alignments of 45 ribosomal proteins using the PHYML program. The bootstrap value is based upon 1,000 replications.

(B) Gene order comparison of Baumannia and Blochmannia floridanus. The plot shows the locations of homologous proteins between the two genomes.

All of these insect endosymbionts, including Baumannia, exhibit many genome-level trends commonly found in intracellular organisms when compared to free-living relatives, including a smaller genome, lower G + C content, a higher average predicted isoelectric point for encoded proteins, and more rapidly evolving proteins (Table 1, Figure 2A). Of critical importance to understanding these trends is that they occur in all types of intracellular organisms (e.g., mutualists and pathogens) from across the tree of life (archaea, bacteria, and eukaryotes). Much research has focused on trying to understand the mechanisms underlying these trends for which there are two major hypotheses: the loss of DNA repair genes resulting in subsequent changes in mutation patterns or changes in population genetic parameters that lead to more genetic drift [19,20].

As a global explanation, the population genetic forces have more support (e.g., [2123]), but the issue is far from resolved. One reason for this lack of resolution is that it is usually difficult to reconstruct the early events in the evolution of intracellularity. This insect endosymbiont group has many advantages that have made it a model system for resolving these early events. The addition of the Baumannia genome further improves the utility of this group for reasons we detail below.

One limitation of studies of the evolution of intracellular organisms is that the evolutionary separation between free-living and intracellular species is usually very large. For example, although much can be learned about recent mitochondrial evolution by comparative analysis of mitochondrial genomes, it is not even known what subgroup of Alphaproteobacteria contains the closest free-living relative of these organelles. This is because the mitochondrial symbiosis originated billions of years ago. The insect endosymbionts lack this limitation both because their symbioses evolved relatively recently and because of the large diversity of genomes available for the Gammaproteobacteria. To make the most use of these benefits, it is imperative to have an accurate picture of the phylogeny of the symbionts. The addition of the Baumannia genome is useful in this regard because its proteins appear to be evolving more slowly (as indicated by shorter branch lengths in Figure 2A) than those in the other endosymbionts. Having one organism with relatively short branch lengths in this group makes it more likely that the monophyly of the insect endosymbionts in trees is a reflection of their true history and not an artifact of phylogenetic reconstruction known as long-branch attraction.

The branch-length finding is an example of how Baumannia can be considered as a “missing link” in that it is an intermediate in many ways between the other insect endosymbionts and free-living species. This is the case not only for branch length but also for phylogenetic position (it is the deepest branching species), isoelectric point (pI), and G + C content (Table 1). By filling in the gaps between the free-living and intracellular species, the Baumannia genome should allow better inferences of the early events in the evolution of intracellularity.

Baumannia is not intermediate in value between free-living species and other insect endosymbionts for all “intracellular” features. For example, its genome size is smaller than that of some of the other endosymbionts. This is an important finding since the absolute values for many other features are highly correlated, both in this group and in other symbiont groups [24]. An example of this is shown for pI and G + C content (Figure 3). Another way of looking at this is that the Baumannia genome has shrunk more than one might expect based on its other intracellular features. This decoupling of the rates of change of different features can be useful in understanding the patterns of evolution in intracellular species. For example, one explanation for the pattern in Baumannia is that although it has experienced more gene loss than some of the other insect endosymbionts, it has maintained the most complete set of DNA repair genes for the group (Table 2). This retention of repair functions may have slowed its rate of change in other features, such as sequence change. If true, this suggests that, although the general differences between intracellular and free-living species may be due to population genetic forces, the variation among intracellular species may be due in part to variation in DNA repair. Consistent with this is the finding that species with the longest branch lengths in the trees (Wigglesworthia and Blochmania, Figure 2A) are those that are missing the mismatch repair genes (Table 2).

thumbnail

Figure 3. Correlation between Genomic G + C Content and the Average pI of the Proteins of Endosymbiotic and Free-Living Gammaproteobacteria

Species shown are Buchnera aphidicola APS (BaAPS), Buchnera aphidicola BP (BaBp), Buchnera aphidicola SG (BaSg), Baumannia (Bc), Blochmannia floridanus (Bf), Blochmannia pennsylvanicus (Bp), E. coli K12 (Ec), Wigglesworthia glossindia (Wg), and Yersinia pestis KIM (Yp).

thumbnail

Table 2.

Homologs of Genes Known to Be Involved in DNA Repair and Recombination in the Complete Genomes of Insect Endosymbionts

The differential loss of repair genes among organisms that share many other genome properties allows the insect endosymbiont group to serve as a model for studying the long-term effects of loss of various repair processes. For example, the consequences for genome evolution of losing recA can be examined by comparing Baumannia and Wigglesworthia, which retain it, to Buchnera, which lack it. The same logic can be used to study the effects of the loss of the DNA replication initiation gene dnaA which is missing from Baumannia (see above), Wigglesworthia, and Blochmannia [18,25] but is present in the other insect endosymbionts. Although the species without recA may be able to survive with little or no recombination, those lacking dnaA must make use of alternative initiation pathways. Some alternatives such as pathways based on priA and recA [26] can be ruled out since at least one of these is missing from each of the species missing dnaA. The recBCD genes may play some role in initiation. This would explain why the recBCD genes are present in all insect endosymbionts (Table 2) including those missing recA, which is required for the “normal” role of recBCD in recombination.

Single Nucleotide Polymorphisms Are Abundant in the Baumannia Population

Genetic variation among individuals is both a complication of genome sequencing projects of uncultured species and a valuable source of information about microbial populations. For the Baumannia data, we used a stringent search protocol that may have missed some true polymorphisms but should have eliminated variation that was due to sequencing errors or cloning artifacts (see Materials and Methods). In total, we identified 104 single nucleotide polymorphisms (SNPs) and two insertion-deletion differences (indels) that fit these criteria. Details of the locations and types of polymorphisms are given in Table 3.

thumbnail

Table 3.

Categorizations of Polymorphisms Detected in the Assembled Baumannia Genome

Since our DNA was isolated from the symbionts of hundreds of hosts, one major question is whether the observed polymorphisms were between symbionts within one host or between hosts. We used polymerase chain reaction surveys of individual insects to address this question. Of the 40 insects for which sequences were obtained individually for two loci, 35 showed identity to the consensus sequence for the Baumannia genome and five possessed the alternative alleles that were present as minority bases at four sites (two per fragment). No polymorphism was detected within individual hosts. Thus, the polymorphisms that we identified are real, and they reflect differences between symbionts of different hosts.

Since the Baumannia can be treated as maternally inherited markers, the finding of significant levels of polymorphism between hosts suggests that the sampled population contains individuals from two separate origins. This is somewhat in conflict with theories suggesting a single introduction of a small number of individuals into California [10] but is consistent with results from recent mitochondrial analysis [27].

Sequence polymorphisms have been detected in genomic studies of other insect endosymbionts [3,28]. The most relevant one for comparison to Baumannia is a study of the ant endosymbiont Blochmannia pennsylvanicus, although we note that the criteria they used for detecting a polymorphism were somewhat less stringent than ours [28]. The percentage of the SNPs that are in coding regions is different in the two species (81% in Baumannia and 65% in B. pennsylvanicus), but this is in line with differences in gene-coding density (88% in Baumannia and 76% in B. pennsylvanicus). For both species, the percentage of SNPs in protein-coding genes that represent synonymous differences is higher than expected from random changes given the genomic base compositions (52% in Baumannia and 62% in B. pennsylvanicus). This indicates ongoing purifying selection in both genomes. The most significant difference between the species is the higher ratio of transitions to transversions in B. pennsylvanicus (2.9 versus 1.4 in Baumannia; Table 3). We propose that this is due to the absence of mismatch repair genes in B. pennsylvanicus (as discussed above), which in other species leads to an increase in transition mutations [29]. An absence of mismatch repair would also explain the higher incidence of indels in B. pennsylvanicus.

Metabolic Reconstructions Provide Insight into the Biology of Baumannia

Predictions of the metabolism of an organism from its genome sequence are critical to studies of uncultured organisms because of the difficulty of experimental studies. We have generated such a prediction for Baumannia (Figure 4). Although all such predictions should be viewed as hypotheses, not facts, they are greatly improved by having closely related species for which experimental studies are available. This is yet another advantage of working on the insect symbionts in the Gammaproteobacteria. For example, almost all Baumannia genes have clearcut orthologs in well-studied organisms such as Escherichia coli.

thumbnail

Figure 4. Predicted Metabolic Pathways in Baumannia and the Predicted Amino Acid Biosynthesis Pathways Encoded by the Partial Genome Sequence of Sulcia

Genes that are present are in red and the corresponding catalytic pathways are illustrated in solid black lines; the genes that are absent in the Baumannia genome and genes that have not been identified in the partial Sulcia genome are in gray, and the corresponding metabolic steps are illustrated in gray lines.

As expected, based on its small genome size, Baumannia has a relatively limited repertoire of synthetic capabilities. There are some important features of its predicted metabolism, and we discuss these in this and the next few sections of this paper, calling attention in particular to those of relevance to the host-symbiont interaction.

Baumannia is predicted to synthesize its own cell wall and plasma membrane, processes known to be lost in some intracellular species. It is, however, apparently unable to synthesize the lipopolysaccharide (LPS) commonly found in the outer membrane of other Gram-negative bacteria. The same is true for Buchnera species but not for Wigglesworthia and Blochmannia. The functional significance of this difference is unclear. On one hand, lipid A (the lipid component of the LPS) is generally highly toxic to animal cells; thus, LPS may be disadvantageous for endosymbionts and discarded during their evolution. Alternatively, the difference may reflect differences in the packaging of symbionts within the host bacteriocytes. Buchnera and Baumannia cells are surrounded by host-derived vesicles, while Wigglesworthia and Blochmannia directly contact the host cytoplasm.

The findings in regard to sugar metabolism are consistent with Baumannia acquiring sugars from its host and using them for energy metabolism. For import, a complete mannose phosphotransferase permease system is present including an Enzyme IIMan complex, the phosphotransferase system Enzyme I, and histidyl phosphorylatable protein PtsH. Imported sugars could then be fed into glycolysis. However, since the tricarboxylic acid cycle appears to be incomplete, apparently reducing power must come from other sources such as glycolysis itself, a pyruvate dehydrogenase complex, and an mqo type malate dehydrogenase. An intact electron transport chain consisting of NADH dehydrogenase I, cytochrome o oxidase, and ATP synthase is present.

The most striking aspects of the metabolism of Baumannia relates to what it apparently does and does not do in terms of the synthesis of essential nutrients missing from the hosts’ xylem diet.

Baumannia Is a Vitamin and Cofactor Machine

A large fraction of the Baumannia genome (83 genes, 13.7% of the total) encodes proteins predicted to have roles in pathways for the synthesis of a diverse set of vitamins, cofactors, prosthetic groups and related compounds (Figure 4, Table S1). These include thiamine (vitamin B1), riboflavin (vitamin B2), niacin (vitamin B3), pantothenic acid (vitamin B5), pyridoxine (vitamin B6), as well as biotin and folic acid. More detail on the pathways and the basis for the predictions is given below.

For the synthesis of riboflavin, folate, pyridoxal 5′-phosphate, and thiamine, complete pathways for de novo synthesis could be identified with Baumannia‘s ability to produce endogenously important precursors such as ribulose-5-phosphate, phosphoenolpyruvate, pyruvate, dihydroxyacetonephosphate, glyceraldehyde-3-phosphate, erythrose-4-phosphate, guanosine triphosphate, 5-aminoimidazole ribonucleotide, 5′-phosphoribosylglycinamide, and 5,10-methylene-tetrahydrofolate.

For some compounds, although homologs of enzymes carrying out key steps in other species could not be identified, candidates for alternatives are present suggesting the pathways are complete. For example, the step normally carried out by erythrose 4-phosphate dehydrogenase (Epd) in the pyridoxal 5′-phosphate pathway might be carried out by glyceraldehyde 3-phosphate dehydrogenase (GapA) as seen in some other species [30].

There are some compounds for which we could identify homologs of all known genes in biosynthetic pathways. However, some enzymes in these pathways are still unknown in any organism, and thus we could not identify them here. This is true for the pyrimidine phosphatase in the riboflavin pathway and the dihydroneopterin monophosphate dephosphorylase in the folic acid pathway. We believe it is likely that these pathways are complete in Baumannia and that, due to its ultracompact gene pool, Baumannia provides an ideal opportunity to identify the genes encoding the enzymes for these steps.

Perhaps most interesting are the pathways for which we could identify genes underlying many downstream steps but for which Baumannia would need to import some intermediates to feed those steps. For example, Baumannia encodes genes for the last three steps for siroheme synthesis, and the last step of heme O pathway, but candidate genes underlying the upstream steps could not be identified. Thus, Baumannia needs to import prophobilinogen and protoheme as substrates for these incomplete pathways. This pattern is particularly apparent in that Baumannia appears to be able to synthesize many cofactors from amino acids but is unable to synthesize the amino acid precursors. Examples of such pathways and the amino acid required include thiamin (tyrosine), biotin (alanine), pyridine nucleotides (aspartate), and folate and pyridoxal 5′-phosphate (glutamine and glutamate). This suggests that Baumannia must import these amino acids. The lack of amino acid biosynthesis pathways also makes it a necessity for Baumannia to import 2-ketovaline as a precursor for the synthesis of pantothenate and coenzyme A.

Due to the diversity of vitamin and cofactor synthesis pathways that are present, we conclude that Baumannia is providing its host with these compounds due to their low abundance in its diet. In this respect Baumannia is more similar to Wigglesworthia, the symbiont of tsetse flies, than to Buchnera.

Amino Acid Biosynthetic Pathways Are Generally Absent from Baumannia and Likely Are Found in Another Organism in the System

In contrast to what is seen for vitamin and cofactor synthesis, Baumannia is predicted to encode a very limited set of amino acid synthesis pathways. The few capabilities that are present include histidine biosynthesis, synthesis of methionine if external homoserine is provided, and the ability to make chorismate but not to use it as substrate for production of aromatic amino acids as in most bacterial species. Except for histidine, no complete pathways for the synthesis of any amino acids essential to the host are present.

The lack of amino acid synthesis pathways is apparently compensated by an ability to import amino acids from the environment using a general amino acid ABC transporter, an arginine/lysine ABC transporter, a lysine permease, and a proton/sodium-glutamate symport protein, although the gene for the latter is disrupted by one frameshift. The import of amino acids is apparently used not just for making proteins but also for energy metabolism. The latter is evident by the presence of the aspartate ammonia-lyase AspA, which could be used to convert l-aspartate to fumarate, which in turn can be fed into the tricarboxylic acid cycle.

The absence of essential amino acid synthesis pathways from Baumannia implies that both the host and Baumannia must obtain amino acids from some external source or sources. The sole diet of H. coagulata is xylem sap [10], in which essential amino acids are rare to absent; however, a substantial portion of the nitrogen in xylem occurs in the form of certain nonessential amino acids, including glutamine, aspartic acid, and asparagine (e.g., [11,14,31,32]). The essential amino acid synthesis pathways have not been found in any animal species studied to date, and nutritional studies in insects indicate that these compounds are required nutrients in insects as in mammals. Thus, the most plausible alternative is that another organism that is reliably present in the “ecosystem” of the host body is synthesizing the missing amino acids.

Analysis of Leftover Shotgun Sequence Reads Reveals the Presence of Amino Acid Synthesis Genes in Organisms Other than Baumannia

The most likely candidate for another organismal source of the amino acid synthesis pathways is Sulcia, the other coevolving symbiont found in bacteriomes mentioned above. Although we did not set out to sequence the Sulcia genome as part of this project, we realized we might have inadvertently acquired some of it since many sequence reads from the shotgun sequencing did not assemble with the Baumannia genome. These reads derived from cells of other organisms that were present in the tissue samples we used to isolate DNA for the Baumannia sequencing. An initial search of these sequence reads revealed the presence of homologs of genes with roles in the synthesis of essential amino acids. However, we could not conclude that these reads were from Sulcia, since there could have been cells of other organisms in the sample as well. To sort the extra reads into taxonomic bins, we adapted methods we have used to sort sequences from environmental shotgun sequencing projects (see Materials and Methods) and were able to assign non-Baumannia sequences to three main groups: host, Wolbachia related, and Sulcia related.

The finding of some Wolbachia in the sample was not surprising since rRNA surveys have shown that these alphaproteobacterial relatives of Rickettsia are found in many sharpshooters including H. coagulata. We note that we did not detect any sequences from the previously sequenced phytopathogen X. fastidiosa, which colonizes the surface of the foregut and is not present in the bacteriomes that we used for DNA isolation. In addition, although some of our sequences show high identity to sequences annotated as being from a phytoplasma, we believe this annotation is incorrect. The “phytoplasma” DNA was isolated from the saliva of the leafhopper Orosius albicinctus [33]. However, all the sequences in our sample that showed matches to sequences annotated as “phytoplasma”-like show phylogenetic relationships to the Bacteroidetes phylum. In addition, Sulcia is known to be a symbiont of species in the Deltocephalinae, the leafhopper subfamily containing O. albicinctus [9]. Thus, the putative “phytoplasma”-like sequences with matches in our sample are likely from the Sulcia symbiont of O. albicinctus. Why these sequences appeared in samples from salivary secretions is unclear.

Amino Acid Synthesis Pathways Are in Sulcia and Not Other Organisms in the Sample

Of the essential amino acid synthesis genes identified in the extra shotgun sequence reads, the vast majority (31 of 32) were assigned to the Sulcia bin. In contrast, only one gene (argB) was found in the Wolbachia bin and none were found in the host bin. We therefore sought to obtain as much sequence information as possible from the Sulcia symbionts in this system. First, we completed the sequence of all plasmid clones for which at least one read had been assigned to the Sulcia bin. In addition, we constructed a new library from tissue thought to contain more of the Sulcia symbiont than the library used for the initial sequencing. End-sequencing of this library identified some additional Sulcia-derived clones, and these, too, were completely sequenced. After conducting another round of assembly and assigning contigs and sequences to taxonomic bins, we were able to assign 146,384 bp of unique sequence to Sulcia. In these data, we identified 166 protein-coding genes. A phylogenetic analysis of a concatenated alignment of ribosomal proteins groups this protein set within the Bacteroidetes, thus supporting our assignment of these sequences to Sulcia (Figure 5).

thumbnail

Figure 5. Maximum-Likelihood Tree of Sulcia with Species in the Bacteroides and Chlorobi Phyla for which Complete Genomes Are Available

The tree was build using the PHYML program from the concatenated alignments of 34 ribosomal proteins. The bootstrap values are based upon 1,000 replications.

Although theoretically we could obtain a complete genome sequence of Sulcia by very deep sequencing of the samples we have obtained, this was not practical given limited funds. Nevertheless, analysis of the incomplete genome is quite revealing. First, among the 166 predicted proteins are 31 that underlie steps or whole pathways for the synthesis of amino acids essential for the host (Figure 4). These include the complete pathway of threonine biosynthesis and nearly complete pathways for the synthesis of leucine, valine, and isoleucine (the only gene not sampled is ilvE encoding the branched chain amino acid aminotransferase). In addition, multiple genes in the pathways for the synthesis of lysine, arginine, and tryptophan are present. We believe it is likely that these pathways are present and that the missing genes are in the unsequenced parts of the genome.

One question that remains is where Sulcia gets all of the nitrogen for these amino acids. One possibility is that it acquires and then converts nitrogenous organic compounds, particularly the nonessential amino acids known to be present in xylem (e.g., [14,32]). Alternatively, it is possible that Sulcia assimilates nitrogen from compounds such as ureides or ammonium, which are found in xylem (e.g., [14,32,34]). It has been proposed that X. fastidiosa, the plant pathogen vectored by H. coagulata, makes use of the ammonium in xylem as a nitrogen source [35]. Alternatively, Sulcia could garner inorganic nitrogen from the host, for which ammonium is a waste product [10,13]. Host waste is apparently is a source of nitrogen for Blattabacterium, close relatives of Sulcia that are symbionts of cockroaches [36]. Although some insect genomes encode enzymes that may allow for this (e.g., glutamine synthetase or glutamate synthase (e.g., [37]), it is not yet known whether these capabilities are present in sharpshooters. Whatever the source of its nitrogen, the genome analysis indicates that Sulcia apparently can make the amino acids required by the host.

The other abundant organism in our DNA was Wolbachia, an unlikely candidate as the source of these compounds. Wolbachia cannot be an obligate symbiont of sharpshooters because it infects only some individuals. Screening of individual H. coagulata indicates that some do not contain Wolbachia ([16], two of 40 insects were uninfected in our screens); and screening of individuals of the closely related species, Homalodisca literata (a synonym of H. lacerta), revealed no cases of Wolbachia infection. Also, although we have sampled only a fraction of the Wolbachia genome, the absence of amino acid synthesis pathways is consistent with the complete lack of essential amino acid biosynthesis in any of several sequenced Wolbachia genomes (two complete and many incomplete) [23,38,39].

We therefore conclude that Sulcia is most likely the sole provider of essential amino acids for H. coagulata. Thus, this member of the Bacteroidetes phylum appears to function in a similar way to Buchnera and Blochmannia species in the Gammaproteobacteria.

Sulcia and Baumannia Complement Each Other

We found very few genes in the partial Sulcia genome for vitamin or cofactor synthesis. Since the Sulcia genome appears to be quite small and we have apparently sampled a large fraction of it, we can speculate that few such genes are likely to be present. Thus, in the 146 kb of sequence assigned to Sulcia, we have already found many of the core housekeeping types of genes (e.g., 40 ribosomal proteins and ten tRNA synthetases (Figure 6, Table S2). A very small genome size is consistent with phylogenetic reconstructions indicating that Sulcia is an extremely old symbiont, originating in the Permian [9].

thumbnail

Figure 6. The Distribution into Functional Role Categories of the 166 Predicted Genes Encoded in the 146,384-bp Partial Sequence of the Sulcia Genome

Data are shown for all ORFs that encode proteins longer than 45 amino acids that have BLASTP matches with an E-value less than 10−3 to proteins in complete genomes. Different fragments of the same gene are counted as one gene in the chart.

The paucity of vitamin and cofactor synthesis pathways in Sulcia suggests the possibility that Sulcia and Baumannia play complementary, nonoverlapping roles in this symbiotic system. Not only do they appear to provide different resources for the host (Sulcia provides the amino acids and Baumannia the vitamins and cofactors) but, based on the current evidence, each does not provide the resources made by the other (Table 4). Indeed, the single essential amino acid biosynthetic pathway present in the Baumannia genome, that for histidine, is correspondingly the sole essential amino acid pathway with multiple steps for which no genes were detected in Sulcia. Thus, although Baumannia and the host apparently depend on Sulcia for the majority of essential amino acids, Sulcia and the host may depend on Baumannia for histidine. The complementarity between host and each symbiont extends to mutual dependence between the symbionts, which appear to depend on each other for these required compounds and for intermediates in other metabolic processes. For example, we predict that Sulcia can make homoserine, which, as discussed above, could be the substrate for methionine synthesis in Baumannia. In addition, the valine pathway in Sulcia could be the source of the 2-ketovaline for pantothenate and coenzyme A biosynthesis in Baumannia. Exchange of intermediates may be occurring for many aspects of metabolism. In the case of ubiquinone, a key component of the electron transport chain, Baumannia lacks genes encoding the needed biosynthetic enzymes and thus likely needs to import ubiquinone. The same appears to be true for menaquinone. Strikingly, even though only four of the 166 proteins in Sulcia are predicted to be involved in pathways of cofactor synthesis, two are for production of menaquinone and ubiquinone production, which are among the few cofactors whose synthesis is not carried out by Baumannia.

thumbnail

Table 4.

The Complementarity of Amino Acid Biosynthesis and Cofactor Biosynthesis Pathways between Baumannia and Sulcia

The coresidence of Sulcia and Baumannia, presented here from H. coagulata, is representative of a symbiotic pair that is distributed in most or all sharpshooters, a xylem-feeding insect group [9,16]. Thus, the possibility of metabolic complementarity that is suggested by the genome analyses reflects long coevolution of the three lineages represented by the insects and the two bacteria. The two symbionts occur in close proximity within the yellow portion of the host bacteriomes [16], and Baumannia cells often appear to adhere to the surface of the much larger Sulcia cells. This arrangement is illustrated in images from our in situ hybridizations for H. literata, a close relative of H. coagulata (Figure 7).

thumbnail

Figure 7. Baumannia and Sulcia Coinhabit the Bacteriomes of the Host Insects

Fluorescent in situ hybridizations were performed using oligonucleotide probes designed to hybridize selectively to the ribosomal RNA of Baumannia (green) and of Sulcia (red), respectively. Bacteriomes were obtained from Homalodisca literata (a very close relative of H. coagulata).

Conclusions

The glassy-winged sharpshooter, H. coagulata, feeds on xylem sap, which has very low levels of many nutrients required by insects and other animals [10]. Sequence analysis suggests the occurrence of an obligate symbiosis among three organisms: H. coagulata, the gamma-proteobacterial endosymbiont Baumannia, and the Bacteroidetes bacterial symbiont Sulcia. The two bacterial symbionts co-occur within the cytosol of sharpshooter bacteriocytes, sometimes residing within the same cells. The main function of Baumannia, as revealed by its genomic sequence, is to provide cofactors, especially water-soluble B-family vitamins, to the host. Partial sequences from Sulcia suggest that it provides essential amino acids to the host. The two endosymbionts appear to show functional complementarity and show little overlap in biosynthetic pathways, although full sequencing of the Sulcia genome is needed for a comprehensive view of the contributions of these two organisms. Our analysis shows the added insight possible from assigning sequences to organisms rather than treating environmental samples as a representative of a communal gene set.

Many questions remain regarding this fusion of separate lineages into a single metabolic system. For example, the different organisms must balance their contributions to the shared metabolism through coordinated growth and gene expression, and the mechanisms underlying this integration are not known. Also, these bacterial genomes have undergone major reduction in size while apparently maintaining their complementary capabilities, raising the question of how the steps in genome reduction have been coordinated. The sharpshooters and their obligate bacterial endosymbionts provide a simple model of genomic coevolution, a process that has likely been central in the evolution of most organisms living in stable associations.

Materials and Methods

Isolation of DNA for sequencing.

The material for sequencing was obtained from adults of H. coagulata collected in a lemon orchard in Riverside, California, in June 2001 and June 2004. The California population was introduced from southeastern United States, Texas, or Mexico within the past 20 years [10,27]. DNA was isolated by first dissecting out the red portion of the bacteriome, which contains mainly Baumannia [16]. Approximately 200 adults were dissected, in PA buffer, and kept on ice. Immediately following the dissection, the bacteriome samples were disrupted with a pestle and were passaged in PA buffer through a 20-μm filter and then through an 11-μm filter, on ice. The filtering was intended to remove nuclei of the host insect cells. DNA was isolated from the filtered material using standard methods [16]. For the second sample, adults were collected in 2004 from the same lemon orchard as before; in this case, we attempted to increase representation of the Sulcia genome by dissecting out the yellow portion of the bacteriome from approximately 200 adults and then processing as for the first sample.

Library construction and shotgun sequencing.

DNA libraries were constructed by shearing the genomic DNA through nebulization, cutting DNA of a particular size out of an agarose gel, and cloning it into the pHOS2 plasmid vector. Then 13,926 sequencing reads were generated from a 3- to 4-kb insert-sized library that was constructed using the first “red bacteriome” DNA sample. In addition, a large insert library (10- to 12-kb inserts) was constructed with DNA purified from the second “yellow bacteriome” DNA sample and 3,396 reads were generated from this library. In order to get more sequences to close the Baumannia genome and finish Sulcia clones, 2,986 sequencing reads were generated in the closure efforts.

Assembly and closure of the Baumannia genome.

The shotgun sequence data were assembled using the TIGR assembler [40], and the genome of Baumannia was closed using a combination of primer walking, multiplex PCR, and generation and sequencing of transposon-tagged libraries. Repeats were identified using RepeatFinder [41], and sequence and assembly of the repeats were confirmed using PCRs that spanned the repeat. The final assembly was checked such that every single base is covered by at least two clones and has been sequenced at least once in each direction. The average depth of coverage for the genome is 6.4. A putative origin of replication was identified by analysis of transitions in oligonucleotide skew [42].

Identification and sequence of fragments of the genome of Sulcia

Sequence reads from the shotgun sequencing data that did not map to the Baumannia genome were processed to sort them into candidate taxonomic groups (bins). First, they were assembled into contigs (although the vast majority of sequences did not assemble). Then each contig was analyzed to assign it to a putative bin using a combination of BLAST searches and phylogenetic trees. All sequences were searched with BLASTN and BLASTX against multiple sequence databases to identify top scoring matches. In addition, the BLASTX search results were used to identify possible proteins encoded in the sequences; these proteins were then used to build phylogenetic trees. The taxonomic identity of the nearest neighbor in these trees was extracted and stored. From these search results, sequences were assigned to taxonomic groupings of as low a taxonomic level as possible (e.g., if a protein grouped within a clade of sequences from insects, it was assigned to an insect bin). Examination of the results revealed that there were three major bins: insect, Wolbachia, and Bacteroidetes. There were also many sequences that were not readily assignable to one of these bins but could be assigned to higher level groups such as “Bacteria.” Based on rRNA studies and other work, we assumed that all sequences that were assigned to animals were likely from the host, and that all assigned to Bacteroidetes were likely from the Sulcia symbiont. Thus we refer to these bins as host and Sulcia, respectively.

Initial analysis indicated that there were some genes encoding proteins predicted to be involved in amino acid synthesis in the Sulcia bin. In order to get more data from this taxonomic group, we decided to finish sequencing any clones that mapped to this group and that were at the end of contigs. In order to reduce the probability of wasting funds sequencing clones from another organism, we developed more stringent criteria for selecting which of the initial Sulcia bin sequences to characterize further. In these criteria, at least one of the following had to be true: (1) part of the contig contained a match of greater than 99% identity to the previously sequenced 16S to 23S rDNA of Sulcia [16]; (2) BLASTP searches of translated sequences against all complete microbial genomes gave a best match (based on E-value) to a member of the Bacteroidetes phylum; (3) predicted proteins branched with genes from Bacteroidetes species in neighbor joining trees; and (4) the sequences were significantly AT biased. For all sequence reads that were assigned to Sulcia using these criteria, if they were at the end of a contig, the remainder of the clone was sequenced.

After this additional sequencing, all sequence reads (including the new reads) that did not map to the Baumannia genome were reassembled using the Celera Assembler. From this new assembly, a “final” list of contigs likely to be from Sulcia was identified using similar criteria as above: first, the fragment had to have ORFs that either had a best scoring BLAST hit to a sequence in the Bacteroidetes phylum or position next to a Bacteroidetes gene in neighbor-joining trees of the proteins identified by BLASTP. In addition, GC content had to either be below 40% or the fragment had to have greater than 99% match for at least 200 bases to the previously sequenced 16S rDNA of the Bacteroidetes endosymbiont of the H. coagulata. The low GC content criterion was applied to exclude contamination from free-living bacteria in our DNA sample.

From the new assemblies, contigs we also reassigned to the Wolbachia and host bins. To be considered to be from Wolbachia, the contig had to have not been assigned to Baumannia or Sulcia and had to have a top BLASTX hit to sequences from other Wolbachia. In total, 43,079 bp of unique sequence were assigned to Wolbachia. Another 120 kb worth of sequences and assemblies could not be assigned conclusively to Sulcia, Wolbachia, or Baumannia but had top BLAST hits to bacterial genes.

Genome annotation.

For the Baumannia genome, the GLIMMER program was used to identify putative CDSs [43]. Some putative CDSs were discarded if they had no significant sequence similarity to known genes and if they had significant overlaps with other CDSs with significant sequence similarity to known genes. Noncoding RNAs were identified as described previously [23]. Gene function annotation was based on results of BLASTP searches against Genpept and completed microbial genome and hidden Markov model searches of the PFAM and TIGRFAM databases [44,45]. We identified only four genes in the Baumannia genome that did not have BLASTP matches to any protein entries in Genpept or proteins from publicly available complete genomic sequences (using an E-value cutoff of 0.01). GC skew and nucleotide composition analysis were performed as described previously [23].

For the partial Sulcia genome, ORFs were identified using the EMBOSS package [46]. Only those predicted peptides that were larger than 45 amino acids in length and that had BLASTP hits against microbial genome databases at E-value cutoff of 0.001 were kept as potential genes. The functional annotation of the Sulcia genes is mostly based on the top BLASTP hits.

DNA polymorphisms in Baumannia

Polymorphism analysis was done on the results of the initial assemblies of the shotgun sequence data. Finished sequences were not used since these were based on part on targeted sequencing of select clones, which eliminates the random nature found in the shotgun data. SNPs and indels were identified using stringent criteria to identify regions with variation among sequence reads that were not likely due to sequencing errors.

A site was considered to have an SNP if (1) it had high sequence quality (≥40 PHRED score); (2) the assembly column in which it was found had more than 4-fold coverage; (3) it had differences among the reads at that position, and (4) the variable site was adjacent to at least three invariant positions on both sides. We used only positions that did not have variable flanking sites to prevent alignment errors from mistakenly causing us to score a site as polymorphic. SNPs in coding regions were characterized as synonymous (no amino acid change), conservative (common amino acid change), nonconservative (unusual amino acid chance), or nonsense (stop codon), with a BLOSUM80 matrix being used to distinguish conservative from nonconservative.

Alignment gaps were scored as INDELs only if (1) the column with the gap had at least 4-fold coverage; (2) the aligned column had at least two high-quality sequence reads (≥40 PHRED score), and (3) three consecutive sequence reads on both sides of the gap(s) were of high quality (≥40 PHRED score).

To determine whether the polymorphisms occurred within or between individual host insects, DNA was extracted from the bacteriomes of 40 individual H. coagulata. These individuals were from the same collection that was used for the genomic sequences and had been frozen at −80 °C at the time of collection. PCR primers were designed for two regions (554 bp and 725 bp) that contained SNPs. These regions were amplified, the reaction products cleaned with Qiagen (Valencia, California, United States) miniprep columns, and the products were sequenced directly in both directions at the University of Arizona Genomic Analysis and Technology Center using an ABI 3730 sequencing machine.

We also used these 40 individuals to determine whether Wolbachia, which was detected in our sequence dataset, was present in all insects in the population. This determination was made on the basis of diagnostic PCR based on two genes, 16S rRNA and wsp, with the Baumannia SNP loci described above used as controls for DNA quality. Individuals with products for both Wolbachia loci were scored as positive, and individuals lacking both were scored as negative. (No individuals yielded one product and not the other.)

Comparative genomics.

The predicted proteomes of Baumannia, Wigglesworthia, Blochmannia, and three strains of Buchnera were combined into one database. “All vs. all” BLASTP searches were performed for this database, and a Lek clustering algorithm was applied to cluster the peptides into gene families. An E-value cutoff of 1 × 10−4 for the BLASTP results and a Lek similarity cutoff of 0.6 were chosen for the gene family clustering [47]. All the genes were searched against PFAM and the TIGRFAM database by HMMER, as well as against the reference genomes of E. coli K12 and Yersinia pestis KIM by BLASTP. Gene families were curated and functional roles were assigned according to the HMM and BLASTP search results.

Whole genome alignments of Baumannia versus Wigglesworthia, Blochmannia, three strains of Buchnera, E. coli, and Yersinia pestis were performed. Genome alignments were built using the BLASTP-based Java program DAGCHAINER [48] with an E-value cutoff of 1 × 10−5.

Phylogenetic analysis.

A set of 45 ribosomal protein genes for which orthologs could be identified in Baumannia and other genomes of interest was selected. Each ortholog set was aligned using CLUSTALW, the alignments were concatenated, a maximum likelihood tree was built by PHYML, and 1,000 bootstrap replicates were performed [49]. The same approach was adapted for building the maximum likelihood tree from a set of 34 ribosomal protein genes for Sulcia and selected genomes of interest.

Pathway analysis.

The proteomes from Baumannia and Sulcia were searched against KEGG GENES/SSDB/KO [50] databases by BLASTP. Neighbor-joining trees were built by QUICKTREE [51], and EC numbers were assigned to the Baumannia proteins basing on the nearest neighbor in the phylogenetic trees. The list of the EC number present in the Baumannia genome was submitted to the KEGG Web site (http://www.genome.jp/kegg) to obtain all the potential pathways in the genome. Each pathway was examined and verified according to our genome annotations as well as the pathway descriptions in the EcoCyc database [52].

Fluorescent in situ hybridizations to visualize coresiding symbionts.

In order to obtain images of the symbionts and to verify the correspondence of 16S rDNA sequences to the organisms inhabiting bacteriomes, these structures were dissected from newly collected H. literata, a close relative of H. coagulata that occurs in Tucson, Arizona. (This procedure requires live material, and H. coagulata is a major pest that is not yet established in Arizona where this work was carried out.) Bacteriomes were disrupted in buffer, hybridized, and visualized as described in [9], except that mounts were made in antifading Vectashield medium (Vector Laboratories, Burlingame, California, United States), and the microscope and software used were Deltavision RT and SofWoRx V2.50 Suite V1.0 and Imaris V4.0 (Applied Precision, Issaquah, Washington, United States). The two oligonucleotide probes were specific to the homologous regions of the 16S rRNA and were labeled with different fluorescent dyes, enabling visualization of both symbionts within the same preparations.

Supporting Information

Table S1. Predicted Protein Coding Genes in the Baumannia Genome

Predicted functions and role categories are shown.

(659 KB DOC)

Table S2. Predicted Protein Coding Genes in the Sulcia Genome

Predicted functions and role categories are shown.

(206 KB DOC)

Accession Numbers

The genome sequence data have been submitted to multiple sequence databases. All sequence traces have been submitted to the National Center for Biotechnology Information (NCBI) (http://www.ncbi.nlm.nih.gov) Trace Archive and are available at ftp://ftp.ncbi.nih.gov/pub/TraceDB/baumannia_cicadellinicola. The GenBank (http://www.ncbi.nlm.nih.gov/Genbank) closed, annotated genome accession number for Baumannia is CP000238 and annotated data accession number for Sulcia is AANL00000000. The mapping of the traces to the closed genome of Baumannia is in the NCBI Assembly Archive (http://www.ncbi.nlm.nih.gov/Traces/assembly/assmbrowser.cgi)with number pending.

The Institute for Genomic Research (TIGR) accession numbers available in GenBank accession number CP000238 are EnzymeIIMan complex (BCI_0449–0451), phosphotransferase system Enzyme I (BCI_0070), histidyl phosphorylatable protein PtsH (BCI_0069), mqo type malate dehydrogenase (BCI_0001), NADH dehydrogenase I (BCI_0369–0381), cytochrome o oxidase (BCI_0267–0269), ATP synthase (BCI_0140–0147), glyceraldehyde 3-phosphate dehydrogenase (GapA) (BCI_0443), general amino acid ABC transporter (BCI_0250, BCI_0207–0208), arginine/lysine ABC transporter (BCI_0323–0326), lysine permease (BCI_0393), proton/sodium-glutamate symport protein (BCI_0108), and aspartate ammonia-lyase AspA (BCI_0593),

Acknowledgments

We are grateful to Heather Costa for assistance with sharpshooter collecting in Riverside and to Colin Dale and Wendy Smith for help with collections in 2001. Howard Ochman gave advice on the DNA isolation. We would like to acknowledge the TIGR Bioinformatics and IT departments for general support, Claire Fraser-Liggett and Eric Eisenstadt for encouragement, and members of the Eisen research group, especially Martin Wu and Jonathan Badger, for providing bioinformatics tools.

Author contributions. DW, NAM, and JAE conceived and designed the experiments. SEVA, GHP, KLW, HK, LJT, JMZ, HED, PLT, NAM, and JAE performed the experiments. DW, SCD, NAM, and JAE analyzed the data. NAM and JAE contributed reagents/materials/analysis tools. DW, NAM, and JAE wrote the paper. DW and SCD participated in annotation. SEVA participated in library construction: small insert. GHP participated in library construction: large insert. KLW and HK participated in Baumannia closure. LJT participated in Sulcia closure. JMZ participated in closure. HED, PLT, and NAM participated in DNA isolation. PLT and NAM participated in fluorescent in situ hybridization microscopy.

Competing interests. The authors have declared that no competing interests exist.

  1. Shigenobu S, Watanabe H, Hattorl M, Sakaki Y, Ishikawa H (2000) Genome sequence of the endocellular bacterial symbiont of aphids Buchnera sp. APS. Nature 407:81–86. Find this article online
  2. Tamas I, Klasson L, Canback B, Naslund AK, Eriksson AS, et al. (2002) 50 Million years of genomic stasis in endosymbiotic bacteria. Science 296:2376–2379. Find this article online
  3. van Ham RC, Kamerbeek J, Palacios C, Rausell C, Abascal F, et al. (2003) Reductive genome evolution in Buchnera aphidicola. Proc Natl Acad Sci U S A 100:581–586. Find this article online
  4. Russell JA, Latorre A, Sabater-Munoz B, Moya A, Moran NA (2003) Side-stepping secondary symbionts: Widespread horizontal transfer across and beyond the Aphidoidea. Mol Ecol 12:1061–1075. Find this article online
  5. Buchner P (1965) Endosymbiosis of animals with plant microorganisms New York: John Wiley. 909 p.
  6. Kaiser B (1980) Licht- und elecktronenmikroskopische untersuchung der Symbioten von Graphocephala coccinea Forstier (Homoptera: Jassidae). J Insect Morphol Embryol 9:79–88. Find this article online
  7. von Dohlen CD, Kohler S, Alsop ST, McManus WR (2001) Mealybug beta-proteobacterial endosymbionts contain gamma-proteobacterial symbionts. Nature 412:433–436. Find this article online
  8. Gomez-Valero LM, Soriano-Nvarro V, Perez-Brocal A, Heddi A, Moya JM, et al. (2004) Coexistence of Wolbachia with Buchnera aphidicola and a secondary symbiont in the aphid Cinara cedri. J Bacteriol 186:6626–6633. Find this article online
  9. Moran NA, Tran P, Gerardo NM (2005) Symbiosis and insect diversification: An ancient symbiont of sap-feeding insects from the bacterial phylum Bacteroidetes. Appl Environ Microbiol 71:8802–8810. Find this article online
  10. Redak RA, Purcell AH, Lopes JR, Blua MJ, Mizell RF, et al. (2004) The biology of xylem fluid-feeding insect vectors of Xylella fastidiosa and their relation to disease epidemiology. Annu Rev Entomol 49:243–270. Find this article online
  11. Andersen P, Brodbeck B, Mizell R (1989) Metabolism of amino acids, organic acids and sugars extracted from the xylem fluid of four host plants by adult Homalodisca coagulata. Entomol Exp Appl 50:149–59. Find this article online
  12. Anderson PC, Brodbeck BV, Mizell RF (1992) Feeding by the leafhopper, Homalodisca coagulata, in relation to xylem fluid chemistry and tension. J Insect Physiol 38:611–622. Find this article online
  13. Andersen PC, Brodbeck B, Mizell RF (1995) Diurnal variation in tension, osmolarity and the composition of nitrogen and carbon assimilates in xylem fluid of Prunus persica, Vitis hybrid and Prunus communis. J Am Hort Sci 120:600–604. Find this article online
  14. Malaguti D, Millard P, Wendler R, Hepburn A, Tagliavini M (2001) Translocation of amino acids in the xylem of apple (Malus domestica Borkh.) trees in spring as a consequence of both N remobilization and root uptake. J Exp Bot 52:1665–1671. Find this article online
  15. Schjoerring JK, Husted S, Mäck G, Mattsson M (2002) The regulation of ammonium translocation in plants. J Exp Bot 53:883–890. Find this article online
  16. Moran NA, Dale C, Dunbar H, Smith WA, Ochman H (2003) Intracellular symbionts of sharpshooters (Insecta: Hemiptera: Cicadellinae) from a distinct clade with a small genome. Environ Microbiol 5:116–126. Find this article online
  17. Lerat E, Daubin V, Moran NA (2003) From gene trees to organismal phylogeny in prokaryotes: The case of the gamma-proteobacteria. PLoS Biol 1:e9 DOI: 10.1371/journal.pbio.0030316. Find this article online
  18. Gil R, Silva FJ, Zientz E, Delmotte F, Gonzalez-Candelas F, et al. (2003) The genome sequence of Blochmannia floridanus: Comparative analysis of reduced genomes. Proc Natl Acad Sci U S A 100:9388–9393. Find this article online
  19. Moran NA (1996) Accelerated evolution and Muller’s ratchet in endosymbiotic bacteria. Proc Natl Acad Sci U S A 93:2873–2878. Find this article online
  20. Itoh T, Martin W, Nei M (2002) Acceleration of genomic evolution caused by enhanced mutation rate in endocellular symbionts. Proc Natl Acad Sci 99:12944–12948. Find this article online
  21. Herbeck JT, Funk DJ, Degnan PH, Wernegreen JJ (2003) A conservative test of genetic drift in the endosymbiotic bacterium Buchnera: Slightly deleterious mutations in the chaperonin groEL. Genetics 165:1651–1660. Find this article online
  22. Rispe C, Delmotte F, van Ham RC, Moya A (2004) Mutational and selective pressures on codon and amino acid usage in Buchnera endosymbiotic bacteria of aphids. Genome Res 14:44–53. Find this article online
  23. Wu M, Sun LV, Vamathevan J, Riegler M, Deboy R, et al. (2004) Phylogenomics of the reproductive parasite Wolbachia pipientis wMel: A streamlined genome overrun by mobile genetic elements. PLoS Biol 2:e69 DOI: 10.1371/journal.pbio.0020069. Find this article online
  24. Wernegreen JJ, Degnan PH, Lazarus AB, Palacios C, Bordenstein SR (2003) Genome evolution in an insect cell: Distinct features of an ant-bacterial partnership. Biol Bull 204:221–231. Find this article online
  25. Akman L, Yamashita A, Watanabe H, Oshima K, Shiba T, et al. (2002) Genome sequence of the endocellular obligate symbiont of tsetse flies, Wigglesworthia glossinidia Nat Genet 32:402–407. Find this article online
  26. Asai T, Sommer S, Bailone A, Kogoma T (1993) Homologous recombination-dependent initiation of DNA replication from DNA damage-inducible origins in Escherichia coli. EMBO J 12:3287–3295. Find this article online
  27. Smith PT (2005) Mitochondrial DNA variation among populations of the glassy-winged sharpshooter, Homalodisca coagulata J Insect Sci 5:41. Find this article online
  28. Degnan PH, Lazarus AB, Wernegreen JJ (2005) Genome sequence of Blochmannia pennsylvanicus indicates parallel evolutionary trends among bacterial mutualists of insects. Genome Res 15:1023–1033. Find this article online
  29. Miller JH (1996) Spontaneous mutators in bacteria: Insights into pathways of mutagenesis and repair. Annu Rev Microbiol 50:625–643. Find this article online
  30. Yang Y, Zhao G, Man TK, Winkler ME (1998) Involvement of the gapA– and epd (gapB)-encoded dehydrogenases in pyridoxal 5′-phosphate coenzyme biosynthesis in Escherichia coli K-12. J Bacteriol 180:4294–4299. Find this article online
  31. Brodbeck B, Mizell RF, Andersen P (1990) Amino acids as determinants of host preference for the xylem-feeding leafhopper, Homalodisca coagulata Oecologia 83:338–345. Find this article online
  32. Brodbeck BV, Andersen PC, Mizell RF (1999) Effects of total dietary nitrogen form on the development of xylophagous leafhoppers. Arch Insect Biochem Physiol 42:37–50. Find this article online
  33. Melamed S, Tanne E, Ben-Haim R, Edelbaum O, Yogev D, et al. (2003) Identification and characterization of phytoplasmal genes, employing a novel method of isolating phytoplasmal genomic DNA. J Bacteriol 185:6513–6521. Find this article online
  34. Suárez MF, Avila C, Gallardo F, Cantón R, Garcia-Gutiérrez A, et al. (2002) Molecular and enzymatic analysis of ammonium assimilation in woody plants. J Exp Bot 53:891–904. Find this article online
  35. Simpson JG, Reinach FC, Arruda P, Abreu FA, Acencio M, et al. (2000) The genome sequence of the plant pathogen Xylella fastidiosa. Nature 406:151–157. Find this article online
  36. Wren HN, Cochran DG (1987) Xanthine dehydrogenase activity in the cockroach endosymbiont Blattabacterium cuenoti (Mercier 1906) Hollande and Favre 1931 and in the cockroach fat body. Comp Biochem Physiol 88:1023–1026 B. Find this article online
  37. Scaraffia PA, Isoe J, Murillo A, Wells MA (2005) Ammonia metabolism in Aedes aegypti. Insect Biochem Mol Biol 35:491–503. Find this article online
  38. Foster J, Ganatra M, Kamal I, Ware J, Makarova K, et al. (2005) The Wolbachia genome of Brugia malayi Endosymbiont evolution within a human pathogenic nematode. PLoS Biol 3:e121 DOI: 10.1371/journal.pbio.0030121. Find this article online
  39. Salzberg SL, Hotopp JC, Delcher AL, Pop M, Smith DR, et al. (2005) Serendipitous discovery of Wolbachia genomes in multiple Drosophila species. Genome Biol 6(3):R23. Find this article online
  40. Sutton GG, White O, Adams MD, Kerlavage AR (1995) TIGR Assembler: A new tool for assembling large shotgun sequencing projects. Genome Sci Technol 1:9–19. Find this article online
  41. Volfovsky N, Haas BJ, Salzberg SL (2001) A clustering method for repeat analysis in DNA sequences. Genome Biol 2:research0027.1–27.11. Find this article online
  42. Worning P, Jensen LJ, Hallin PF, Stærfeldt LJ, Ussery DW (2006) Origin of replication in circular prokaryotic chromosomes. Environ Microbiol 8:353–361. Find this article online
  43. Salzberg SL, Delcher AL, Kasif S, White O (1998) Microbial gene identification using interpolated Markov models. Nucleic Acids Res 26:544–548. Find this article online
  44. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, et al. (2004) The Pfam protein families database. Nucleic Acids Res 32:D138–D141. Find this article online
  45. Haft DH, Selengut JD, White O (2003) The TIGRFAMs database of protein families. Nucleic Acids Res 31:371–373. Find this article online
  46. Rice P, Longden I, Bleasby A (2000) EMBOSS: The European Molecular Biology Open Software Suite. Trends Genet 16:276–277. Find this article online
  47. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, et al. (2001) The sequence of the human genome. Science 291:1304–1351. Find this article online
  48. Haas BJ, Delcher AL, Wortman JR, Salzberg SL (2004) DAGchainer: A tool for mining segmental genome duplications and synteny. Bioinformatics 20:3643–3646. Find this article online
  49. Guindon S, Gascuel O (2003) A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 52:696–704. Find this article online
  50. Kanehisa M, Goto S (2000) KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 28:27–30. Find this article online
  51. Howe K, Bateman A, Durbin R (2002) QuickTree: Building huge neighbour-joining trees of protein sequences. Bioinformatics 18:1546–1547. Find this article online
  52. Keseler IM, Collado-Vides J, Gama-Castro S, Ingraham J, Paley S, et al. (2005) EcoCyc: A comprehensive database resource for Escherichia coli. Nucleic Acids Res 33:D334–D337. Find this article online

Annoying email from AN UNNAMED COMPANY and the growing amount of biotech SPAM

TO ALL

BECAUSE SOME PEOPLE WERE UPSET I NAMED THE COMPANY INVOLVED IN THIS SPAM AND BECAUSE THE COMPANY NAME IS NOT REALLY RELEVANT I HAVE REMOVED THE NAME FROM HERE AND WOULD BE HAPPY TO TELL ANYONE DETAILS OFFLINE.

Just got an annoying email from A NOW UNNAMED BIOTECH COMPANY advertising some new product they have that is completely useless to me. Basically, this was biotech. SPAM. No way of unsubscribing. I never requested email from them. I do not think I have ever bought anything from them. And now an email from someone AT THIS UNNAMED COMPANY trying to sell something of no use to me. This is troubling in many ways – since it is happening with more and more biotechnology companies. Not only does it suggest they are getting desperate (really surprising from THIS COMPANY actually) but also that they are becoming more like marketers of cheap VIAGRA knockoff than valuable assets to the world. So I have decided to out all such spam that I get here on my blog. THIS COMPANY – you are #1. Please stop what you are doing and stop wasting people’s time.

The power of Open Access II – Posting my open access papers in my blog

I am continuing with my trend of posting my Open Access papers here. Here is my Tetrahymena genome paper published in PLoS Biology with many others as Eisen JA, Coyne RS, Wu M, Wu D, Thiagarajan M, et al. (2006) Macronuclear Genome Sequence of the Ciliate Tetrahymena thermophila, a Model Eukaryote. PLoS Biol 4(9): e286 doi:10.1371/journal.pbio.0040286

Macronuclear Genome Sequence of the Ciliate Tetrahymena thermophila, a Model Eukaryote

Jonathan A. Eisen1¤a*, Robert S. Coyne1, Martin Wu1, Dongying Wu1, Mathangi Thiagarajan1, Jennifer R. Wortman1, Jonathan H. Badger1, Qinghu Ren1, Paolo Amedeo1, Kristie M. Jones1, Luke J. Tallon1, Arthur L. Delcher1¤b, Steven L. Salzberg1¤b, Joana C. Silva1, Brian J. Haas1, William H. Majoros1¤c, Maryam Farzad1¤d, Jane M. Carlton1¤e, Roger K. Smith Jr.1¤f, Jyoti Garg2, Ronald E. Pearlman2,3, Kathleen M. Karrer4, Lei Sun4, Gerard Manning5, Nels C. Elde6¤g, Aaron P. Turkewitz6, David J. Asai7, David E. Wilkes7, Yufeng Wang8, Hong Cai9, Kathleen Collins10, B. Andrew Stewart10, Suzanne R. Lee10, Katarzyna Wilamowska11, Zasha Weinberg11¤h, Walter L. Ruzzo11, Dorota Wloga12, Jacek Gaertig12, Joseph Frankel13, Che-Chia Tsao14, Martin A. Gorovsky14, Patrick J. Keeling15, Ross F. Waller15¤j, Nicola J. Patron15¤j, J. Michael Cherry16, Nicholas A. Stover16, Cynthia J. Krieger16, Christina del Toro17¤k, Hilary F. Ryder17¤l, Sondra C. Williamson17, Rebecca A. Barbeau17¤m, Eileen P. Hamilton17, Eduardo Orias17

1 The Institute for Genomic Research, Rockville, Maryland, United States of America, 2 Department of Biology, York University, Toronto, Ontario, Canada, 3 Centre for Research in Mass Spectrometry, York University, Toronto, Ontario, Canada, 4 Department of Biological Sciences, Marquette University, Milwaukee, Wisconsin, United States of America, 5 Razavi-Newman Center for Bioinformatics, The Salk Institute for Biological Studies, San Diego, California, United States of America, 6 Department of Molecular Genetics and Cell Biology, University of Chicago, Chicago, Illinois, United States of America, 7 Department of Biology, Harvey Mudd College, Claremont, California, United States of America, 8 Department of Biology, University of Texas at San Antonio, San Antonio, Texas, United States of America, 9 Department of Electrical Engineering, University of Texas at San Antonio, San Antonio, Texas, United States of America, 10 Department of Molecular and Cellular Biology, University of California Berkeley, Berkeley, California, United States of America, 11 Department of Computer Science and Engineering, University of Washington, Seattle, Washington, United States of America, 12 Department of Cellular Biology, University of Georgia, Athens, Georgia, United States of America, 13 Department of Biological Sciences, University of Iowa, Iowa City, Iowa, United States of America, 14 Department of Biology, University of Rochester, Rochester, New York, United States of America, 15 Canadian Institute for Advanced Research, Department of Botany, University of British Columbia, Vancouver, British Columbia, Canada, 16 Department of Genetics, Stanford University, Stanford, California, United States of America, 17 Department of Molecular, Cellular, and Developmental Biology, University of California Santa Barbara, Santa Barbara, California, United States of America

The ciliate Tetrahymena thermophila is a model organism for molecular and cellular biology. Like other ciliates, this species has separate germline and soma functions that are embodied by distinct nuclei within a single cell. The germline-like micronucleus (MIC) has its genome held in reserve for sexual reproduction. The soma-like macronucleus (MAC), which possesses a genome processed from that of the MIC, is the center of gene expression and does not directly contribute DNA to sexual progeny. We report here the shotgun sequencing, assembly, and analysis of the MAC genome of T. thermophila, which is approximately 104 Mb in length and composed of approximately 225 chromosomes. Overall, the gene set is robust, with more than 27,000 predicted protein-coding genes, 15,000 of which have strong matches to genes in other organisms. The functional diversity encoded by these genes is substantial and reflects the complexity of processes required for a free-living, predatory, single-celled organism. This is highlighted by the abundance of lineage-specific duplications of genes with predicted roles in sensing and responding to environmental conditions (e.g., kinases), using diverse resources (e.g., proteases and transporters), and generating structural complexity (e.g., kinesins and dyneins). In contrast to the other lineages of alveolates (apicomplexans and dinoflagellates), no compelling evidence could be found for plastid-derived genes in the genome. UGA, the only T. thermophila stop codon, is used in some genes to encode selenocysteine, thus making this organism the first known with the potential to translate all 64 codons in nuclear genes into amino acids. We present genomic evidence supporting the hypothesis that the excision of DNA from the MIC to generate the MAC specifically targets foreign DNA as a form of genome self-defense. The combination of the genome sequence, the functional diversity encoded therein, and the presence of some pathways missing from other model organisms makes T. thermophila an ideal model for functional genomic studies to address biological, biomedical, and biotechnological questions of fundamental importance.

Funding. This project was supported by grants to JAE from the National Science Foundation Microbial Genome Sequencing Program (EF-0240361) and the National Institutes of Health–National Institute of General Medical Sciences (R01 GM067012–03). We also acknowledge Genome Canada for support of EST library construction and sequencing through the Protist EST Project and grant RR-009231 to EO from the National Institutes of Health (the National Center for Research Resources) which supported the RAPD and Cbs work and an EO subcontract to NSF grant MCB-0132675 which supported sequence analyses related to number of chromosomes and their copy number.

Competing interests. The authors have declared that no competing interests exist.

Academic Editor: Mikhail Gelfand, Institute for Information Transmission Problems, Russian Federation

Citation: Eisen JA, Coyne RS, Wu M, Wu D, Thiagarajan M, et al. (2006) Macronuclear Genome Sequence of the Ciliate Tetrahymena thermophila, a Model Eukaryote. PLoS Biol 4(9): e286 doi:10.1371/journal.pbio.0040286

Received: January 4, 2006; Accepted: June 23, 2006; Published: August 29, 2006

Copyright: © 2006 Eisen et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Abbreviations: bp, base pairs; Cbs, chromosome breakage sequence; CM, covariance model; EST, expressed sequence tag; IES, internal eliminated sequence; ITR, inverted terminal repeat; MAC, macronucleus/macronuclear; MIC, micronucleus/micronuclear; ncRNA, noncoding RNA; RIP, repeat induced point mutation; SCI, single-cell isolation; Sec, selenocysteine; TE, transposable element; TGD, Tetrahymena Genome Database; TIGR, The Institute for Genomic Research; VIC, voltage-gated ion channel

* To whom correspondence should be addressed. E-mail: jaeisen@ucdavis.edu

¤a Current address: University of California Davis Genome Center, Section of Evolution and Ecology, School of Biological Sciences and Department of Medical Microbiology and Immunology, School of Medicine, University of California Davis, Davis, California, United States of America

¤b Current address: Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America

¤c Current address: Duke Institute for Genome Sciences and Policy, Duke University, Durham, North Carolina, United States of America

¤d Current address: Agilent Technologies, Inc., Santa Clara, California, United States of America

¤e Current address: Department of Medical Parasitology, New York University School of Medicine, New York, New York, United States of America

¤f Current address: Dupont Agriculture and Nutrition, Wilmington, Delaware, United States of America

¤g Current address: Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America

¤h Current address: Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, Connecticut, United States of America

¤j Current address: School of Botany, The University of Melbourne, Melbourne, Australia

¤k Current address: Meharry Medical College, Nashville, Tennessee, United States of America

¤l Current address: Dartmouth-Hitchcock Medical Center, Lebanon, New Hampshire, United States of America

¤m Current address: Lung Biology Center, University of California San Francisco, San Francisco, California, United States of America

Introduction

Tetrahymena thermophila is a single-celled model organism for unicellular eukaryotic biology [1]. Studies of T. thermophila (referred to as T. pyriformis variety 1 or syngen 1 prior to 1976 [2]) have contributed to fundamental biological discoveries such as catalytic RNA [3], telomeric repeats [4,5], telomerase [6], and the function of histone acetylation [7]. T. thermophila is advantageous as a model eukaryotic system because it grows rapidly to high density in a variety of media and conditions, its life cycle allows the use of conventional tools of genetic analysis, and molecular genetic tools for sequence-enabled experimental analysis of gene function have been developed [8,9]. In addition, although it is unicellular, it possesses many core processes conserved across a wide diversity of eukaryotes (including humans) that are not found in other single-celled model systems (e.g., the yeasts Saccharomyces cerevisiae and Schizosaccharomyces pombe).

T. thermophila is a member of the phylum Ciliophora, which also includes the genera Paramecium, Oxytricha, and Ichthyophthirius. A cartoon showing the phylogenetic position of T. thermophila relative to other eukaryotes for which the genomes have been sequenced is shown in Figure 1. The ciliates are one of three major evolutionary lineages that make up the alveolates. The other two lineages are dinoflagellates and the exclusively parasitic apicomplexa, which includes the Plasmodium species that cause malaria. Although experimental tools are improving for the apicomplexa [1012], they can still be challenging to work with, and in some situations T. thermophila can serve as a useful “distant cousin” model for this group [13].

thumbnail

Figure 1. Unrooted Consensus Phylogeny of Major Eukaryotic Lineages

Representative genera are shown for which whole genome sequence data are either in progress (marked with asterisks *) or available. The ciliates, dinoflagellates, and apicomplexans constitute the alveolates (lighter yellow box). Branch lengths do not correspond to phylogenetic distances. Adapted from the more detailed consensus in [197].

As is typical of ciliates, T. thermophila cells exhibit nuclear dimorphism [14]. Each cell has two nuclei, the micronucleus (MIC) and the macronucleus (MAC), containing distinct but closely related genomes. The MIC is diploid and contains five pairs of chromosomes. It is the germline, the store of genetic information for the progeny produced by conjugation in the sexual stage of the T. thermophila life cycle. Conjugation involves meiosis, fusion of haploid MIC gametes to produce a new zygotic MIC, and differentiation of new MACs from mitotic copies of the zygotic MIC (for details, see [15]). After formation of the MAC, cells reproduce asexually until the next sexual conjugation. During this asexual growth, all gene expression occurs in the MAC, which is thus considered the somatic nucleus.

The MAC genome derives from that of the MIC, but the two genomes are quite distinct. During MAC differentiation, several types of developmentally programmed DNA rearrangements occur [16,17] (Figure 2). One such rearrangement is the deletion of segments of the MIC genome known as internally eliminated sequences (IESs). It is estimated that approximately 6,000 IESs are removed, resulting in the MAC genome being an estimated 10% to 20% smaller than that of the MIC [18]. A key aspect of the process is the preferential removal of repetitive DNA, which results in 90% to 100% of MIC repeats being eliminated [19,20]. Thus the process can be considered analogous to and more extreme than other forms of repeat element silencing phenomena such as repeat-induced point mutation (RIP) in Neurospora and heterochromatin formation [21,22]. A second programmed DNA rearrangement is the site-specific fragmentation at each location of the 15–base pair (bp) chromosome breakage sequence (Cbs) [2325]. During fragmentation, sections of the MIC genome containing each Cbs, as well as up to 30 bp on either side, are deleted [26]. Telomeres are then added to each new end [27], generating some 250 to 300 MAC chromosomes [28,29].

thumbnail

Figure 2. Relationship between MIC and MAC Chromosomes

The top horizontal bar shows a small portion of one of the five pairs of MIC chromosomes. MAC-destined sequences are shown in alternating shades of gray. MIC-specific IESs (internally eliminated sequences) are shown as blue rectangles, and sites of the 15-bp Cbs are shown as red bars (not to scale). Below the top bar are shown macronuclear chromosomes derived from the above region of the MIC by deletion of IESs, site-specific cleavage at Cbs sites, and amplification. Telomeres are added to the newly generated ends (green bars). Most of the MAC chromosomes are amplified to approximately 45 copies (only three shown). Through the process of phenotypic assortment, initially heterozygous loci generally become homozygous in each lineage within approximately 100 vegetative fissions. Polymorphisms located on the same MAC chromosome tend to co-assort.

Another process that occurs during MAC differentiation is the amplification of the number of copies of the MAC chromosomes. The rDNA chromosome, which encodes the 5.8S, 17S, and 26S rRNAs, is maintained at an average of 9,000 copies per MAC [30]. Six other chromosomes that have been examined are each maintained at an average of 45 copies per MAC [31]. During asexual reproduction, the MAC divides amitotically, with apparently random distribution of chromosome copies that behave as if acentromeric. In contrast, MIC chromosomes are metacentric [32] and are distributed mitotically [33,34]. Parental MAC DNA is not transmitted to sexual progeny, although it does have an epigenetic influence on postzygotic MAC genome rearrangement, mediated by RNA interference [35].

The Tetrahymena research community has coordinated an effort to develop genomic tools for T. thermophila [9,36]. The MAC genome was selected for initial sequencing because it contains all the expressed genes and because the complexity of the assembly process was expected to be reduced due to the lower amounts of repetitive DNA. These advantages, however, are countered by some complexities not seen in other eukaryotic genome projects, including the presence of several hundred medium-sized to small chromosomes, the possibility of unequal copy number of at least some chromosomes, the existence of polymorphisms that are generated during MAC development, and the inability to completely separate the MIC from the MAC prior to DNA isolation.

We report here on the shotgun sequencing, assembly, and analysis of the MAC genome of T. thermophila strain SB210, an inbred strain B derivative that has been extensively used for genetic mapping and for the isolation of mutants. We discuss how the complexities of sequencing the MAC were successfully addressed, as well as the biological and evolutionary implications of our analysis of the genome sequence.

Results/Discussion

Genome Assembly and General Chromosome Structure

Sequencing and assembly. Using physical isolation methods, MAC were purified from a culture of T. thermophila strain SB210 and used to create multiple differentially sized shotgun sequencing libraries (Table S1). Construction of large (greater than 10 kb) insert libraries was not successful—a common problem in working with AT-rich genomes. Approximately 1.2 million paired end sequences were generated from the libraries and assembled using the Celera Assembler [37]. In an initial assembly, the mitochondrial genome (mtDNA; which was present due to some contamination of the MAC preparation with mitochondria) and the highly amplified rDNA chromosome did not assemble well compared to the published sequences of these molecules [38,39]. This was probably because contigs from these molecules had higher depths of coverage than those from other chromosomes, which caused the Celera Assembler to treat them as repetitive DNA. Thus we divided sequence reads into three bins (mtDNA, rDNA, and bulk MAC DNA) and generated assemblies for each bin separately. This resulted in a moderate improvement, and the three separate assemblies were thus used for all subsequent analyses. Detailed sequence and assembly information is presented in Tables 1 and S2.

thumbnail

Table 1.

Important Genome Statistics

The bulk MAC assembly contains 1,971 scaffolds (contigs that have been linked into larger pieces by mate pair information) with a total estimated span of 104.1 Mb. Perhaps most important, using a combination of computational and experimental identification of telomeres, we have found that many scaffold ends correspond to chromosome ends. One hundred twenty-five scaffolds, encompassing 44% of the assembled genome length, are telomere-capped at both ends and thus likely represent complete MAC chromosomes. One hundred twenty additional scaffolds, encompassing another 31% of the genome, are telomere-capped at one end (Tables 1 and S3).

Assembly accuracy and completeness. Overall, all analyses indicate that the bulk MAC assemblies are highly accurate. For example, all 75 MAC loci that are in distinct genetic co-assortment groups (and thus should be on different chromosomes [40]) map to different scaffolds, and all pairs of loci that coassort (and thus should be on the same chromosome) either map to the same scaffold or to two non–fully capped scaffolds whose cumulative size is less than that of the corresponding MAC chromosome (Table S4). For the 24 completely assembled chromosome scaffolds for which we know the corresponding chromosome physical size, there is a very strong correlation between physical size and assembly length. In addition, there are no cases where a scaffold is significantly longer than the physical size of the corresponding chromosome (Figure 3A). Finally, all of the 96 MIC sequences known to be adjacent to Cbs sites [24,41,42] that matched to a MAC scaffold did so only at the scaffold’s end.

thumbnail

Figure 3. Scaffold Sizes

(A) Scaffold sizes versus MAC chromosome size. Blue diamonds represent scaffolds capped by telomeres on both ends. Red squares and green triangles represent incomplete scaffolds capped by telomeres at one or neither end, respectively.

(B) Size distribution of scaffolds capped by telomeres on both ends.

The general accuracy of the assemblies indicates that many of the potential difficulties discussed in the Introduction were not significant. For example, we see little evidence for polymorphism among reads, which is likely a reflection of the use of an inbred strain and the process of phenotypic assortment, which leads to whole-genome MAC homozygous lineages [43]. Also, searches for known MIC-specific sequences indicate that the amount of MIC contamination is very low (e.g., Cbs junctions are at 0.044× coverage which is approximately 200-fold less than the bulk MAC chromosomes) and limited to small contigs (most less than 5 kb). The uniform depth of contig coverage and accuracy of assemblies also suggest that the chromosomes are present in roughly similar copy number and that only limited amounts of repetitive DNA are present in the MAC, both of which are discussed further below.

The total scaffold length is much smaller than the predicted genome size of 180 to 200 Mb [14]. Given the accuracy of the assemblies, the large number of chromosomes partially or completely capped, and the fact that all (more than 200) known MAC DNA sequences are found in the assemblies, we conclude that the assemblies represent a very large (more than 95%) fraction of the genome. We conclude therefore that previous genome size estimates were inaccurate (which is not surprising given that they were made almost 30 years ago) and that the genome is close to 105 Mb in size. It is possible, however, that some chromosomes or regions were underrepresented in our libraries due to purification or cloning bias, and thus one cannot infer the absence of any particular gene or feature simply due to its absence from our current assemblies.

Estimating the number of MAC chromosomes. The total number of MAC chromosomes is unknown. The telomere-capping of scaffolds allows us to place a minimum boundary on this number at 185 (125 plus half of 120). One way of estimating the actual number is through analysis of the non–rDNA telomere-containing reads; 3,328 such reads can be linked to a total of 370 scaffold ends. This corresponds to approximately 9-fold coverage (3,328/370), which is not significantly different from the bulk MAC chromosome coverage of 9.08, indicating that there is no significant underrepresentation of telomere reads (Tables 1 and S3). Thus since there are 4,058 such reads total (the others could not be linked), we estimate that there are approximately 451 telomere ends (4,058/9), and thus that there are approximately 225 chromosomes (451/2). An independent estimate of the actual chromosome number can be made by assuming that the size distribution of fully capped chromosomes (see Figure 3B) is representative of the genome as a whole. Since these 125 capped chromosomes represent 43.5% of the total assembly length, this would predict 287 chromosomes in total (125/0.435). This is likely to be an overestimate, since larger chromosomes are statistically less likely to be in the completely assembled set. Indeed, the average size of completely assembled chromosomes is 359 kb, whereas estimates of the average MAC chromosome size obtained through pulsed-field gel electrophoresis are substantially higher [29,41]. Thus, we conclude that there are between 185 and 287 chromosomes, most likely somewhere near 225.

Absence of many standard global features of eukaryotic chromosomes. We note that we searched for but could not find many of what are considered standard global features of eukaryotic chromosomes. For example, we could not find sequence or structural features shared across multiple chromosomes that could be considered candidates for centromeric regions. This is consistent with experimental studies [44]. In addition, although in many eukaryotes certain genes and repeat elements cluster near telomeres [4551], we cannot detect any such clustering here. This is not because there is no variation in these features; for example, GC content (Figure S1) and gene density (Figure S2) do vary greatly. Instead, the absence of similar global structure between MAC chromosomes is likely due to the absence of the processes that help generate the key features of normal eukaryotic chromosomes (e.g., mitosis and meiosis, which in T. thermophila are confined to the MIC).

MAC chromosome copy number is uniform. The high quality and completeness of the assemblies suggest that copy number variation among at least most MAC chromosomes is relatively small since otherwise the assembler would have treated contigs from overrepresented chromosomes as repetitive DNA. Such uniform copy number is consistent with genetic experimental data for six chromosomes [31], but its generality for all chromosomes has been unknown. We realized that the relative chromosome copy number could be estimated from depth of coverage in our assemblies (assuming that cloning and sequencing success were relatively random). When all scaffolds are examined, the depth of coverage is remarkably uniform (Figure 4). The decrease in uniformity and coverage seen as scaffold size decreases is likely a reflection of both chance low coverage of some regions and some of the small scaffolds being MIC contaminants. When only scaffolds capped by telomeres at both ends are included in the analysis, observed sequence coverage is even more uniform (red diamonds in Figure 4). Although we cannot rule out that some smaller, incompletely assembled chromosomes are maintained at different copy numbers, the observed uniformity indicates that the replication and/or segregation of most or all bulk MAC chromosomes is under coordinated regulation.

thumbnail

Figure 4. Depth of Coverage versus Scaffold Size

Black diamonds indicate all scaffolds; red diamonds, scaffolds capped with telomeres on both ends.

General Features of Predicted Protein Coding Genes and Noncoding RNAs

Protein coding gene predictions. We identified 27,424 putative protein-coding genes in the genome (Table 2), a high number for a single-celled species. These gene models were tested by aligning expressed sequence tags (ESTs) to the genome assemblies using PASA [52]. We note that most of these ESTs were generated after the models were built (Table S5). Of the 9,122 EST clusters identified, most have either no conflicts with the gene models (49.5%) or relatively small ones (17.7% have a missed exon and 9.8% suggest the models need to be merged or split). Only 408 (4.4%) clusters are intergenic relative to the gene models. Although these could represent missed genes or gene regions, they could also be noncoding RNAs (ncRNAs) or genomic DNA contamination of cDNA libraries. In addition, the predicted and EST-derived introns are quite similar in size distribution except at the short and long extremes (Figure S3), GC content (16.3% versus 16.7%), and splice sites [only a small number (85) of EST-based introns have exceptions to the 5′-GT…AG-3′ junctions assumed by the model—these could simply be sequencing errors]. These analyses indicate that the gene models are relatively robust and should be more than sufficient for making general predictions about the coding potential of this species.

thumbnail

Table 2.

Characteristics of Ab Initio Predicted Genes

Two other lines of evidence suggest the predicted gene number is not inflated. First, a large number of the predicted genes have matches to known or predicted genes from other species (14,916 have a BLASTP match with an E-value better than 10−10), and second, experimental studies of mRNA complexity predict transcription of at least 25,000 genes of an average size of 1,200 bp [53]. We also note that the sequence of the largest MAC chromosome of another ciliate, Paramecium tetraurelia, indicates a high coding density, and extrapolation to the complete genome predicts at least 30,000 protein-coding genes [54].

ncRNAs and the use of all 64 codons to code for amino acids. The ncRNAs found in the genome are listed in Table S6. We call attention to a few new findings. Of the 174 putative 5S rRNA genes (Table S6A), 19 do not correspond to any of the four previously reported T. thermophila sequences [55,56]. These 19 differ from one another by single nucleotide substitutions at 34 positions, as well as by various insertions, deletions, and truncations and may represent pseudogenes. In addition, there are two forms of U2 snRNA present (Table S6C), which we have termed U2 (four genes) and U2var (five genes). Functional RNA gene families are expressed ubiquitously during the T. thermophila life cycle and under stress conditions as well (representative data shown in Figure S4). The largest class is tRNAs with 700 identified (Tables S6B and S6D), a number consistent with hybridization-based estimates [57].

One of the more unusual features of T. thermophila and certain other ciliates is the use of an alternative genetic code in which the canonical stop codons UAG and UAA code for glutamine [58]. The importance and age of this alternative code are reflected in the genome by the presence of 39 tRNAs for these codons. Remarkably, analysis of the genome has also revealed the presence of a tRNA that is predicted to decode the remaining stop codon, UGA. Multiple lines of evidence indicate that this is a functioning tRNA for selenocysteine (Sec), the so-called 21st amino acid. In those eukaryotic species that use Sec, most UGA codons still cause translation termination while those mRNAs that encode Sec-containing peptides have a characteristic stem-loop sequence motif in the 3′ UTR region that directs Sec incorporation [59,60]. The putative T. thermophila tRNA-Sec was identified by analysis of the genome sequence and shown to be transcribed and acylated [61], and we have found that it is expressed and charged and that its charging may be under distinct regulatory control from other tRNAs (Figure S4A). In addition, we identified six T. thermophila genes with in-frame UGA codons that align (after editing of the gene models) with known Sec codons of their homologs from other eukaryotic species and that have the stem-loop consensus and thus are likely to encode selenoproteins. Thus we conclude that UGA is almost certainly translated into Sec, which would make T. thermophila the first organism known to use all 64 triplet codons to specify amino acid incorporation.

Genome Evolution

Codon and amino acid usage bias. Although T. thermophila can use all 64 codons, it does not use all equally. The most significant aspect of the codon usage in this species is that the AT-rich codons tend to be used more frequently than others [62,63]. Thus although the AT bias in the genome is strongest in noncoding regions, where selection is thought to be relaxed, it is seen even in coding regions. In fact, the AT pull is so strong in coding regions that amino-acid composition of proteins is shifted toward those coded by codons with high AT content, as seen in other species with extreme AT bias (e.g., [64]). Although the overall codon usage is biased against GC-rich codons, on a gene-by-gene level there is significant variation in the degree of bias. We have identified two dominant patterns to this gene-by-gene variation. The major pattern is that for most genes, the codons used are simply a reflection of the overall AT content of the gene (Figure 5). The variation among genes is due to genomewide variation in AT content (see Figure 5A), although we have been unable to discern a mechanism underlying this variation (e.g., there is no clustering of high or low AT genes near telomeres). There is, however, a less common pattern in the gene-by-gene variation that is very important. There exists a subset of genes (shown in red) that use a common preferred codon set that is different from that of the average gene, and the codons in this set are not strongly correlated to the genes’ AT content. Although the existence of such a preferred codon set for this species has been reported [62,63], analysis of the genome allows the set and the genes that use it to be more precisely defined. In total, using a relatively conservative cutoff (Figure 5B), we have identified 232 such genes.

thumbnail

Figure 5. Codon Usage

(A) Effective number of codons (ENc; a measure of overall codon bias) for each predicted ORF is plotted versus GC3 (the fraction of codons that are synonymous at the third codon position that have either a guanine or a cytosine at that position). The upper limit of expected bias based on GC3 alone is represented by the black curve; most T. thermophila ORFs cluster below the curve [red dots as in (B)].

(B) Principal component analysis of relative synonymous codon usage in T. thermophila. The 232 genes in the tail of the comma-shaped distribution (those with the most biased codon usage) are colored red.

(C) Principal component analysis of relative synonymous codon usage in P. falciparum.

The use of preferred codons by a gene is thought to allow for more efficient or accurate translation [65]. This appears to be the case here as, of the predicted genes using the preferred subset, many have likely housekeeping functions, and, although they account for only 0.85% of all predicted genes, 12.5% of all ESTs map to them (Table S7). Although some do not have EST matches and theoretically could represent falsely predicted genes, it seems unlikely that spurious genes would use the preferred codon set. Thus we predict that these outlier genes are either highly expressed (in at least some of the conditions normally encountered by the organism) or have some critical function requiring accurate translation.

Codon usage differences between genes are thought to have only small fitness effects. For natural selection to effectively work on codon usage differences and to thus create a preferred subset, factors that enhance genetic drift (e.g., small population sizes, inbreeding) must be weaker than the selective forces [66]. Thus although codon usage is probably under selective pressure in all species, not all are able to evolve preferred codon sets. For example, although it has a similar AT bias to T. thermophila, no preferred set could be detected in the apicomplexan Plasmodium falciparum (Figure 5C), possibly a reflection of its parasitic lifestyle and limited effective population size. The presence of a preferred subset in T. thermophila is likely a reflection of a large effective population size due to its free-living, sexually reproducing lifestyle (see [66,67] for additional discussion on the large population size of this species).

No plastid-derived genes can be identified. One question of particular interest that the T. thermophila genome might shed light on relates to the timing of the origin of the plastids found in apicomplexans and dinoflagellates, the other members of the alveolates [68,69]. Although the plastids in these lineages differ (e.g., that in apicomplexans, known as an apicoplast, is not even involved in photosynthesis), both are thought to be of red algal origin [70]. This has led to the proposal that the plastids in these lineages are the result of a single endosymbiotic event between an ancestor of apicomplexans and dinoflagellates and a red alga, with the algal nucleus being lost and the algal plastid being kept. A key question is whether this secondary endosymbiosis occurred before or after the ciliates split off from the other two lineages. The possibility that it occurred before the ciliate split is known as the chromalveolate hypothesis [71].

For the chromalveolate hypothesis to be correct, plastid loss would have to have occurred in ciliates, most likely at the base of the ciliate tree since no modern ciliates are known to harbor plastids. If the ancestor of ciliates once had a plastid, it is possible that some plastid-derived genes would have been transferred to the nuclear genome (as has occurred in many lineages including apicomplexans and dinoflagellates [72]), and furthermore that some such genes would still be found in T. thermophila. To test this possibility, we built phylogenetic trees of all genes in the genome and searched for those with a branching pattern consistent with plastid descent (see Materials and Methods). For T. thermophila, we do not see any signal for genes of plastid descent that rises above the noise seen in such automated phylogenetic analyses.

Several lines of evidence suggest that this is not a general flaw in the phylogenetic approach used here. For example, we have used the same approach to identify and catalog the plastid-derived genes in other lineages including the plant Arabidopsis thaliana and the apicomplexan P. falciparum. In addition, such an approach has been used to detect past endosymbioses in other eukaryotic lineages [73]. Finally, using the same approach we identified 91 likely mitochondrion-derived genes (Table S8) in the T. thermophila nuclear genome. This is significant because mitochondrion-derived genes are generally more difficult to identify than plastid-derived genes [74], in part because the plastid symbiosis was more recent [75].

Nevertheless, since it is possible that our phylogenomic screen might have missed some plastid-derived genes, we also did a targeted search for genes that might be expected to be retained, using the apicoplast as a model. Apicoplasts are involved in biosynthesis of fatty acids, isoprenoids, and heme. Fatty acid and isoprenoid biosynthetic pathways are of special interest because the plastid-derived pathways are distinct from analogous pathways in the eukaryotic cytoplasm [76]. In the case of isoprenoid biosynthesis, genes for proteins in the canonical eukaryotic cytosolic mevalonate pathway are present as expected based on experimental studies [7779], but no enzymes involved in the plastid-derived DOXP pathway were evident. For fatty acid biosynthesis, while T. thermophila does not require an exogenous supply of fatty acids for growth, no evidence for a complete version of a type I (normally cytosolic) pathway could be found. Although at least some genes for a type II pathway are present, these are insufficient for de novo fatty acid synthesis and appear more likely to be derived from the mitochondrion than a plastid.

Based on the general and targeted searches, we conclude that there is presently no evidence for a plastid or ancestrally plastid-derived genes in T. thermophila. This does not preclude the possibility that other ciliates have plastid-derived enzymes or even a plastid, but there is presently no evidence to suggest this despite extensive ultrastructural observations [80,81]. If ciliates do lack all evidence of a plastid, it could either mean that the hypothesized early origin of the chromalveolate plastid is incorrect or that an ancestor of T. thermophila (and perhaps all ciliates) lost its plastid and all detectable plastid-derived genes outright. The latter possibility is not without precedent, as some apicomplexans such as the Cryptosporidia have lost their apicoplasts and have few, if any, plastid-derived genes in their nuclear genomes [82,83]. This loss has been suggested to be the result of metabolic streamlining in response to its parasitic lifestyle. Resolving whether a plastid was present in the ancestor of ciliates will be important to our understanding of the evolution of plastids and their biochemical relationship with eukaryotic hosts.

IES excision targets foreign DNA rather than repetitive DNA per se. As discussed in the Introduction, there are multiple parallels between the IES excision process and other repeat element silencing phenomena such as RIP and heterochromatin formation. Despite these parallels, the processes differ significantly in their mechanisms of action and therefore likely have different short- and long-term evolutionary consequences. For example, in species with RIP, all repetitive DNA becomes a target for mutational inactivation, which has resulted in a drastic suppression of evolutionary diversification through gene duplication [84,85]. The IES excision process results in the exclusion of certain MIC DNA sequences from the transcriptionally active MAC. Experimental introduction of foreign transgenes into the MIC has shown that as MIC copy number increases, so does the efficiency of transgene excision [86]. One might therefore predict a similar suppression of gene duplication as in RIP. However, rather than targeting repetitive DNA per se, it has been proposed that IES excision specifically targets foreign DNA that has invaded the germline MIC but is not represented in the MAC [35,87,88]. MIC gene duplication and functional diversification should still be possible under this scenario as long as, at each conjugation event, the gene copies have not diverged in sequence enough to be recognized as foreign and excluded from the MAC; since sex is frequent in natural populations of T. thermophila [89], this should be the case. We therefore sought to use the genome sequence data to both test the foreign DNA hypothesis and to examine what the consequences of the IES excision process have been on the evolution of the T. thermophila genome.

Analysis of the genome reveals several lines of evidence that provide strong support for the foreign DNA hypothesis. First, small but nevertheless significant amounts of repetitive DNA are present in the MAC. This is best seen in analysis of the scaffolds that correspond to complete MAC chromosomes which are unlikely to contain MIC IES contamination. These scaffolds contain dispersed repeats that make up 2.3% of the total DNA. This means that some repetitive DNA bypasses the IES excision process. The second line of evidence comes from examining the small contigs and singletons (nonassembled sequences) in the assembly data. Known MIC-specific elements such as the REP and Tlr1 transposons [90,91] are found only in these small contigs, which are thus clearly enriched for MIC-specific DNA (and also for repetitive DNA; see Figure S5). In fact, the small contigs contain homologs of an unusually wide range of transposable element (TE) clades for a single-celled eukaryote [92,93] including many previously unreported in Tetrahymena (Table S9). We do not find any good matches to TEs in any of the large contigs. Thus, transposons in general appear to be filtered out very efficiently by the IES excision process. The tandem and dispersed repeats in the MAC appear to correspond to noninvasive DNA (e.g., the 5S rRNA genes). Taken together, the fact that mobile (and likely invasive) DNA elements are kept out of the MAC, combined with the fact that both tandem and dispersed noninvasive repeats avoid the excision process, indicates strong support for the foreign DNA hypothesis.

In organisms with RIP, since all duplicated DNA is targeted [94], gene diversification by duplication is suppressed. For example, the fraction of all Neurospora crassa genes found in paralogous families is only 19%, a value that falls below the overall correlation line between this fraction and total gene number [84]. In addition, very few gene pairs share greater than 80% amino acid sequence identity [84]. Consistent with the foreign DNA hypothesis, we do not see such signs of suppression of gene family diversification in T. thermophila. Large numbers of paralogous genes are found in the genome (1,970 gene families including 10,851 predicted proteins) (Table 3). The fraction of genes in such families in T. thermophila (39%) is much higher than that seen in N. crassa. Although this fraction is not as high as would be predicted from the observed correlation between total number of genes and the fraction found in paralogous families [84], the fraction of gene pairs sharing greater than 80% amino acid identity is much higher than in N. crassa and similar to that found in other sequenced eukaryotes.

thumbnail

Table 3.

Gene Families

Since it is possible some of the 1,970 gene families could have originated by duplications that occurred prior to the origin of the IES excision process, it is more useful to examine recent duplications. We searched for such duplications in multiple ways, including the identification of genes duplicated in the T. thermophila lineage relative to other lineages for which genomes are available (Table S10) and by searching for pairs of paralogs with very similar sequences. Both of these classes are abundant in T. thermophila, further indicating that the IES excision does not significantly affect expansion of gene families of “native” genes. Thus the ciliate system of targeting invading DNA has significantly different consequences than RIP.

High gene count in T. thermophila. The expansion of gene families helps explain the high gene count in T. thermophila, which is higher than that of other protists and even surpasses that of some metazoans (Table 4). The duplication events appear to be spread out over evolutionary time with some being ancient and some quite recent. We searched for but did not find evidence for either whole genome or segmental duplications. We do find extensive numbers of tandemly duplicated genes. In total, 1,603 tandem clusters of between two and 15 genes were found, comprising 4,276 total genes; 67% of these clusters are simple gene pairs and 96% contain five or fewer genes. Thus it appears many of the paralogous genes in T. thermophila are the results of separate small duplication events.

thumbnail

Table 4.

Numbers of Protein-Coding Genes in Various Eukaryotes

The high gene count in T. thermophila relative to some other single-celled eukaryotes is not simply a reflection of gene family expansions. For example, when recent gene expansions are collapsed into ortholog sets, we find that humans and T. thermophila share more orthologs with each other (2,280) than are shared between humans and the yeast S. cerevisiae (2,097) or T. thermophila and P. falciparum (1,325) (Figure 6), despite the sister phyla relationships of animals and fungi on the one hand and ciliates and apicomplexans on the other. We note that this does not mean that humans and T. thermophila are overall more similar to each other than either is to species in sister phyla. For example, humans and S. cerevisiae do share some processes that evolved in the common ancestor of fungi and animals. In addition, for orthologs found in all eukaryotes, the human and S. cerevisiae genes are more similar in sequence to each other than either is to genes from T. thermophila. The higher number of orthologs shared between humans and T. thermophila is a reflection of both the loss of genes in other eukaryotic lineages and the retention of a variety of ancestral eukaryotic functions by T. thermophila. Consistent with this conclusion, there are 874 human genes with orthologs in T. thermophila but not S. cerevisiae, 58 of which correspond to loci associated with human diseases (Table S12). Thus genome analysis reveals many cases where T. thermophila can continue to complement experimental studies of yeast as a model system for eukaryotic (and human) cell biology [13].

thumbnail

Figure 6. Orthologs Shared among T. thermophila and Selected Eukaryotic Genomes

Venn diagram showing orthologs shared among human, the yeast S. cerevisiae, the apicomplexan P. falciparum, and T. thermophila. Lineage-specific gene duplications in each of the organisms were identified and treated as one single gene (or super-ortholog). Pairwise mutual best-hits by BLASTP were then identified as putative orthologs.

Gene Duplication as an Indicator of Important Biological Processes

One motivation for obtaining the genome sequence of an organism is to advance the study of processes already under investigation. Many researchers, including those who have never worked on this species before, have taken advantage of the publicly available data in an effort to achieve this goal (e.g. [24,95103]). Rather than focus our bioinformatic analysis on these well-studied processes, we decided to search for evidence in the predicted proteome of processes of particular importance to the organism. Our approach was relatively straightforward—we looked for overrepresentations (compared to other eukaryotes) in the lists of paralogous gene families or lineage-specific gene family expansions associated with a variety of processes. This approach was taken for several reasons. First, searches for differences in large gene families are not as biased by annotation errors as searches focused on individual genes. In addition, large gene families clearly contribute to the large number of genes present in T. thermophila compared to other single-celled eukaryotes. We note that many of the available genomes of single-celled eukaryotes are of parasites that were selected for sequencing mostly due to their medical relevance and that these are not representative (e.g., many have quite small genomes). Most important, the presence of large gene families and recent gene duplications are likely indications of functional diversity, recent evolutionary innovations, and selective pressures placed on this organism.

Our analysis of paralogous gene families and in particular the recently duplicated members of such families reveals the importance of processes associated with the sensing of and responding to environmental changes. We highlight five such processes here: signal transduction, membrane transport, proteolytic digestion, construction and manipulation of cell shape and movement, and membrane trafficking. These processes are all critical to the free-living heterotrophic lifestyle of this organism. In the following sections, we discuss what the analysis of the genome reveals about these processes in T. thermophila with a particular focus on expansions of genes associated with these functions relative to other species.

Signal transduction and the expansions of kinase families. A variety of genes with putative roles in signal transduction were identified in our screens of paralogous genes. Of these, we chose to perform an in depth analysis of the kinases because they are such a diverse family of proteins and because they have been found to have critical roles in sensory and regulatory processes across the tree of life. In total, 1,069 predicted protein kinases (Tables 5 and S11A) were identified in the genome. This corresponds to approximately 3.8% of the predicted proteome, a fraction significantly larger than the approximately 2.3% in fungi, Drosophila, and vertebrates [104]. Among these, representatives were found of 54 of the known kinase families and subfamilies [105]. Some families found in a wide diversity of eukaryotes [106] were not detected. This includes the checkpoint kinase CHK1/RAD53, the PI3 kinase–related kinase TRRAP, two cyclin-dependent kinases (CDK7 and CDK8, which may be functionally replaced by the related expanded CDC2 family), and two poorly conserved classes (Bub1 and Haspin) that may have been missed by sequence homology searches. Despite the reported presence of phosphotyrosine in T. thermophila [107], no clear members of the tyrosine kinase group could be identified. However, the genome encodes some proteins that might be alternative tyrosine kinases including multiple dual-specificity kinases (e.g., Wee1, Ste7, TTK, and Dyrk) as well as five members of the related TKL group, which may mediate tyrosine phosphorylation in the slime mold Dictyostelium discoideum [106]. Twelve kinase classes are found in T. thermophila and humans but not yeast, and thus are apparent examples of the retention of ancestral eukaryotic functions discussed above. Several of the genes in these classes have been implicated in the etiology of human disease (Dyrk1A, DNAPK, SGK1, RSK2, Wnk1, and Wnk4) [108].

thumbnail

Table 5.

Distribution of Selected Protein Kinase Classes in T. thermophila and Other Classified Kinomes

A key feature of the T. thermophila kinome is the expansion of several kinase classes relative to other sequenced organisms (Table 5). The implications of some of these expansions can be predicted based on the known functions of family members. For example, the mitotic kinase families Aurora, CDC2, and PLK are all substantially expanded, perhaps reflecting the additional signaling complexity required by two nuclei that simultaneously engage in very different processes within the same cell cytoplasm. Also expanded are multiple kinases that interact with the microtubule network [109,110] [e.g., Nima-related kinases (NRKs) and the ULK family], possibly reflecting diversification of cytoskeletal systems (discussed more below). Of the kinase families with known functions, the most striking expansion is the presence of 83 histidine protein kinases (HPKs), which are generally involved in transducing signals from the external environment [111]. HPKs are found predominantly in two-component regulatory systems of bacteria, archaea, protists, and plants and are absent from metazoans. Most of the T. thermophila HPKs have substrate receiver domains, and many are predicted to be transmembrane receptors.

The full meaning of the kinome diversity in T. thermophila is hard to predict as a great deal of the diversification has occurred in classes for which the functions are poorly understood. For example, in many of the known kinase families, the T. thermophila proteins are highly diverse in sequence, both relative to those in other species as well as to each other (e.g., see Figure S6). The scope of the diversification in T. thermophila is perhaps best seen in the fact that 630 (approximately 60%) of the kinases could not be assigned to any known family or subfamily [105]. Overall, 37 novel classes of kinases and hundreds of unique proteins were identified in this genome. The presence of so many novel kinases and expansions in many known classes of kinases is both an indication of the versatility of the eukaryotic protein kinase domain seen in other lineages [112] and suggestive of a great elaboration of ciliate-specific functions.

Diversification of membrane transport systems. Many of the most greatly expanded T. thermophila gene families encode proteins predicted to be involved in membrane transport. Membrane transporters play critical roles in responding to variations in the environment and making use of available resources. We therefore conducted a more thorough analysis of the predicted transporters in this species. Overall, T. thermophila possesses a robust and diverse collection of predicted membrane transport systems (Tables 6 and S11B). Comparison to other eukaryotes [113] reveals some interesting differences in terms of both classes of transporters and predicted substrates being moved. For example, T. thermophila has more representatives in each of the four major families than do humans. In addition, it encodes a much higher number of transporters in the ABC superfamily, voltage-gated ion channels (VICs), and P-type ATPases than any other sequenced eukaryotic species (Table 6) including the other free-living protists, the diatom Thalassiosira pseudonana, and the slime mold D. discoideum. Regarding substrates, an extremely extensive set of transporters likely specific for inorganic cations has been identified (Table 6). Most of these are channel-type transporters and cation-transporting P-type ATPases. Interestingly, despite the apparent massive amplification of cation transporters, T. thermophila has a very limited repertoire of transporters for inorganic anions: only one member each for sulfate, phosphate, arsenite, and chromate ion were identified, and there are no predicted anion channels. The reason for the difference in the amplification of cation versus anion transporters is unclear.

thumbnail

Table 6.

Comparison of the Numbers of Membrane Transporters in T. thermophila and Other Eukaryotes by Family and Predicted Substrate

As with kinases, some of the most interesting properties are revealed by examination of the lineage-specific duplications of transporters. The recent clusters include K+ channel proteins (285 members), ABC transporters (152 members), cation-transporting ATPases (59 members), K+ channel beta subunit proteins (22 members), oxalate:formate antiporters (24 members), sugar transporters (22 members), and phospholipid-transporting ATPases (20 members). The expansion of the K+ channel proteins, which are VIC-type transporters, was particularly large and was pursued further.

In total, 308 VIC-type K+-selective channels have been predicted, many more than in any other sequenced species and over three times as many as identified in humans (89). A multigene family of potassium ion channels has also been identified in P. tetraurelia [114] and thus may be a general characteristic of some ciliates. Some lines of evidence suggest that this expansion in ciliates could be adaptive. First, K+ channels control the passive permeation of K+ across the membrane, which is essential for ciliary motility [115]. Second, a novel adenylyl cyclase with a putative N-terminal K+ ion channel regulates the formation of the universal second messenger cAMP in ciliates and apicomplexans [116,117], which could assist in responding to sudden changes of the ionic environment. T. thermophila encodes six homologs of this adenylate cyclase/K+ transporter, whereas the parasitic apicomplexans P. falciparum and Cryptosporidium parvum encode only one each.

The robust transporter systems present are likely a reflection of T. thermophila’s behavioral and physiological versatility as a free-living single-celled organism and its exposure to a wide range of different substrates in its natural environment. Examination of the specific types of expansions suggests that functions associated with transport of K+ and other cations have been greatly diversified. Thus such functions may play a role in many of the unique aspects of the biology of this species and ciliates in general.

Proteolytic processing. T. thermophila is a voracious predator and thus might be expected to have a wide diversity of proteolytic enzymes. Analysis of the predicted proteins in T. thermophila reveals some conflicting results relating to this idea. On the one hand, many of the largest clusters of lineage-specific duplications are of proteases (e.g., papain, leishmanolysin). On the other hand, the total number of proteases identified (480) is relatively low in terms of the fraction of the proteome (1.7%) compared to other model organisms that have been sequenced and annotated [118120]. The conflict is most likely a reflection of the diversity of physiological processes in which proteases function [121]. Thus we examined the subclassification of types of proteases present in more detail.

Using the Merops protease nomenclature, which is based on intrinsic evolutionary and structural relationships [119] the T. thermophila proteases were divided into five catalytic classes and 40 families. These are: 43 aspartic proteases belonging to two families, 211 cysteine proteases belonging to 11 families, 139 metalloproteases belonging to 14 families, 73 serine proteases belonging to 12 families, and 14 threonine proteases belonging to the T1 family (Tables 7 and S11C). Some unique features of T. thermophila can be seen by comparison to P. falciparum which is the most closely related sequenced species to have a detailed analysis of its proteases published [122]. Twenty-one protease families are present in both genomes. For example, the highly conserved threonine proteases and the ubiquitin carboxyl-terminal hydrolase families (C12 and C19) reflect the crucial role of the ATP-dependent ubiquitin-proteasome system, which has been implicated in cell-cycle control and stress response [123]. Nineteen protease families are present in T. thermophila but not P. falciparum. One of these includes leishmanolysin (M8), originally identified in the kinetoplastid parasite Leishmania major and thought to be involved in processing surface proteins [124126]. This family is greatly expanded (to 48 members, including 15 in a tandem array) in T. thermophila and suggests that surface protein processes may be important here, although the functions of leishmanolysin-related proteases in nonkinetoplastid eukaryotes remain unclear. The carboxypeptidase A (M14) and carboxypeptidase Y (S10) families are expanded to 28 and 25 members, respectively, in T. thermophila, which may reflect numerous and diverse functions. Only four protease families present in P. falciparum are not found in T. thermophila. Among these are metacaspase (C14), an ancestral type of caspase that is characteristic of apoptosis or apoptosis-like signal transduction pathways [127].

thumbnail

Table 7.

Protease Complements in T. thermophila and Other Model Organisms

The largest clusters of expanded proteases in T. thermophila are all cysteine proteases, which comprise 44% of the total protease complement. The two most prominent families from this class are the papain family (C1), which is the most abundant and complex family, with 114 members, and the ubiquitin carboxyl-terminal hydrolase 2 family (UCH2, C19) with 47 members. It is possible that the biochemical activity among the paralogs within these families is conserved but that they are used in different parts of the cell (or outside the cell) or in different developmental stages in T. thermophila.

Cytoskeletal components and regulators. Ciliates have highly complex cytoskeletal architecture [128] with highly polarized cell types which assemble 18 types of microtubular organelles in specific locations along the anteroposterior and dorsoventral axis. We therefore sought to determine whether this diversity was reflected in the genome. As with the protease analysis described above, initial comparisons of the number of particular types of cytoskeletal and microtubule-associated proteins was somewhat ambiguous (the numbers for humans and T. thermophila are shown in Tables 8 and S11D). For example, although kinesin and dynein motors as well as kinases associated with microtubules appear to be expanded, structural components of the cilia and participants in the intraflagellar transport pathway are not. In addition, some cytoskeletal protein types are apparently absent from T. thermophila; these include intermediate filament proteins (including nuclear lamins) as already suggested by biochemical studies [129], some microtubule-associated proteins (MAP2, MAP4, and Tau, for which no nonanimal eukaryotic homologs have been found) and some actin-binding proteins (e.g., α-actinin). To better understand what role genes involved in microtubule and cytoskeletal functions might have played in the diversification of this species, we focused analysis on some of the genes with apparent expansions: tubulins, dyneins, and regulatory proteins.

thumbnail

Table 8.

Numbers of Loci Encoding Selected Types of Cytoskeletal Genes in T. thermophila and H. sapiens

Tubulins. Tubulins are the key structural components of microtubules and they come in many forms in eukaryotes [130]. In the T. thermophila genome, phylogenetic analysis of tubulin homologs (Figure 7) reveals the presence of one or two genes, each within the essential alpha (α), beta (β), and gamma (γ) subfamilies (as reported previously [131133]) and one in each of the delta (δ), epsilon (ɛ), and eta (η), which are found in organisms that possess centrioles/basal bodies [134136]. In addition, T. thermophila encodes noncanonical tubulin homologs that can be divided into two categories. In the first category are genes that are most similar to the canonical α- or β-tubulins. These nine genes (three α-like and six β-like) lack characteristic motifs for the tail domain post-translational modifications (polyglutamylation and polyglycylation) that are essential to the function of their canonical counterparts [137139]. Three of the β-like genes (BLT1/TTHERM_01104960, TTHERM_01104970, and TTHERM_01104980) form a tandem cluster with intergenic intervals of less than 2 kb. We hypothesize that these genes function, perhaps redundantly, in formation or function of some of the many highly specialized microtubule systems of T. thermophila cells. Experimental analysis of BLT1, a β-like tubulin, indicated that its product localizes to a small subset of microtubules and is not incorporated into growing ciliary axonemes (K. Clark and M. Gorovsky, unpublished data). Genetic deletion of this gene or of the α-like gene TTHERM_00647130 did not yield an obvious phenotype (R. Xie and M. A. Gorovsky, unpublished data).

thumbnail

Figure 7. Tubulin Gene Diversity in T. thermophila

The figure shows a neighbor-joining tree built from a clustalX alignment. Species abbreviations: Hs, H. sapiens; Dm, D. melaogaster; Sc, S. cerevisiae; Tt, T. thermophila; Pt, P. tetraurelia; Cr, C. reinhardtii; Tb, T. brucei; Ec, E. coli; Xl, Xenopus laevis. A prokaryotic tubulin ortholog, Escherichia coli FtsZ, was used as the outgroup.

The second category of noncanonical tubulin homologs consists of three novel proteins (TTHERM_00550910, TTHERM_01001250, and TTHERM_01001260) that fall into a clade with P. tetraurelia iota tubulin. Two of these (TTHERM_01001250 and TTHERM_01001260) are closely related to each other (Figure 7) and closely linked in the genome and thus likely arose by a recent tandem duplication. The functions of these genes are unknown, but because they are, so far, unique to ciliates, they might be responsible for microtubule functions specific to this phylum.

Dyneins. Dyneins, which were first discovered in Tetrahymena [140], are molecular motors that translocate along microtubule tracks, a process critical to many activities in T. thermophila including ciliary beating, karyokinesis, MAC division, cortical organization, and phagocytosis. Many of these activities are critical for sensing and responding to changes in the environment. Each dynein complex consists of one, two, or three heavy chains (containing the motor activity) and specific combinations of smaller subunits, including intermediate, light-intermediate, and light chains, which regulate motor activity and the tethering of dynein to its molecular cargo [141143]. In organisms with cilia or flagella, there are multiple isoforms of dyneins, including the axonemal outer arm dyneins, the axonemal inner arm dyneins, and nonaxonemal or “cytoplasmic” dyneins. Each is specialized in its intracellular location and the cellular task it performs [144].

In total we identified 21 light chains, five intermediate chains, two light-intermediate chains, and 25 heavy chains (Table S13). The expression of each gene, as well as the exon/intron structures of most, was confirmed by RT-PCR and, if necessary, sequencing of the RT-PCR product. For the most part, the families of T. thermophila dynein subunits appear to be similar to those of other model organisms; however, there are some interesting differences. T. thermophila light chains LC3A and 3B are most similar to the green alga Chlamydomonas reinhardtii‘s LC3 and LC5 [145]. These proteins belong to the larger family of thioredoxin-related proteins, and, without biochemical evidence identifying one or both of the proteins as part of a dynein complex, it may be premature to label these as dynein components. Light chain LC4 belongs to the calmodulin-related family of proteins and may regulate calcium-dependent ciliary reversal. T. thermophila expresses two LC4 genes, perhaps providing alternative or additional ways to control ciliary motility compared to species that express only one. In other systems, LC8 is associated with several different dynein and nondynein complexes, and T. thermophila expresses one canonical LC8 as well as five divergent LC8-like genes, with unknown functions.

Perhaps the most interesting revelation is that T. thermophila expresses 25 dynein heavy chains. These include the 14 DYH genes previously described [146,147] and 11 new ones, all of which appear to be axonemal. The complexity of the DYH family may represent a mechanism by which the organism can fine-tune ciliary activity, produce specialized cilia (e.g., oral and posterior cilia), and/or generate large numbers of new cilia quickly. Along these lines, there has also been an expansion in other motor proteins. For example, there are 78 kinesins, more than in any other sequenced organism ([101] and Table 8). In addition, although there are fewer myosins than in humans (13 versus 22), 12 of 13 of the T. thermophila genes comprise a single novel myosin class not found in other organisms [102,148].

Regulation of microtubules and microtubule-associated processes. Among the expanded genes in T. thermophila are a variety implicated in the regulation of microtubules or microtubule-associated processes. One example is the tubulin tyrosine ligase-like domain proteins of which multiple members have been identified as enzymes responsible for polyglutamylation of either α- or β-tubulin [149]. T. thermophila encodes 50 tubulin tyrosine ligase-like proteins compared with 14 in human. Another example is the NRK family of protein kinases which, as mentioned above, has undergone a large expansion in T. thermophila. NRKs are often found associated with microtubular organelles [150] such as centrioles, basal bodies, and flagella and play multiple roles, including the regulation of centrosome maturation [151] and flagellar excision [152]. We identified 39 NRKs in T. thermophila, roughly three times the number of such loci in humans. Phylogenetic and functional analyses have suggested that this diversification has adapted the members of this family for distinct subcellular localizations and cytoskeletal roles [103]. Thus, such gene expansions could allow differentially targeted protein isoforms to regulate the function of the same organelle type in different locations or generate different properties of the same structural building materials (e.g., microtubules), which are used as frameworks to build different types of organelles.

Secretory pathways and membrane trafficking. Besides the conventional organelles, T. thermophila maintains several more specialized membrane-bound compartments, including alveoli (shared with other alveolates), a contractile vacuole (found in many protists), and separate, functionally distinct macronuclei and micronuclei [128]. It also has multiple pathways for plasma membrane internalization, as well as both constitutive and regulated exocytosis [128,153]. The sorting and trafficking of membrane components are critical functions for all these activities. Analysis of the genome reveals homologs of many of the key proteins known from other eukaryotes to be involved in vesicle formation and fusion, including all major classes of coat proteins (Table S14). One interesting finding that came from genome analysis is that T. thermophila encodes eight dynamin-related proteins, more than most other sequenced unicellular eukaryotes, and two of them, Drp1p and Drp2p, have evolved a new function in endocytosis [96] (A. Rahaman and A. P. Turkewitz, unpublished data). Furthermore, phylogenetic analysis indicated that the recruitment of dynamin to a role in endocytosis occurred independently by convergent evolution in the animal and ciliate lineages [96].

The diversification of membrane trafficking is more apparent in regard to Rab proteins, which are small monomeric GTPases that regulate membrane fusion and fission events. T. thermophila, with 69 Rabs (Table S15), has a number more along the lines of humans (which have 60) than many single-celled species, such as Saccharomyces cerevisiae, which has 11 [154] and Trypanosoma brucei, which has 16 [155]. Based on localization and functional studies, including comparisons between yeast and humans [156], Rabs have been divided into eight groups [157]. Phylogenetic analysis (Figure S7) indicates that T. thermophila encodes representatives of all but groups IV and VII, which are involved in late endocytosis and Golgi transport, respectively. For group VII this appears to reflect a lineage-specific loss, since the genomes of both T. brucei and Entamoeba histolytica have several homologs in this group. Two T. thermophila Rabs appear homologous to Rab28 and Rab32, which have not been assigned to any of these groups; Rab32 was previously thought to be restricted to mammalian lineages. Rab groups II and V, involved in endocytosis, are especially large in T. thermophila and include several Rab2, Rab4, and Rab11 homologs in group II. This may reflect the intricacy of maintaining at least two major pathways of membrane internalization. Additionally, 29 Rabs in T. thermophila fail to cluster with any of the Rab groups found more widely among eukaryotes. Within this group, 20 cluster into three clades, designated Tetrahymena clades I, II, and III in Figure S7, which may represent ciliate-specific radiations. The remaining nine are very divergent and may represent very ancient duplication events and/or changes related to recruitment for novel function. Because unambiguous alignment among such divergent Rabs is difficult, their relationships will become clearer as additional related genomes are sequenced.

Recently, large numbers of Rabs have been found in a variety of amoeboid protists including D. discoideum, E. histolytica [158], and the parabasalid Trichomonas vaginalis [159]. The diversification in these species was proposed to relate to their amoeboid lifestyle [159]. However, the presence of significant diversification in T. thermophila suggests that different protist lifestyles may be accompanied by their own brand of significant Rab diversification.

Tetrahymena Genome Database

An integral part of the effort to make the genomic resources and analyses described above widely available to researchers working with T. thermophila and other organisms has been the creation of the Tetrahymena Genome Database (TGD; http://www.ciliate.org), a Web-accessible resource on the genetics and genomics of T. thermophila. TGD provides information about the T. thermophila MAC genome, its genes and gene products, facts about the ciliate scientific community, and tools for querying the genome and collected scientific literature. TGD was created using the database environment developed for the Saccharomyces Genome Database and software tools contributed to the Generic Model Organism Database (GMOD) project.

Information from the published literature on T. thermophila is distilled in multiple ways. Results from published studies of T. thermophila genes are curated and provided, including community-approved gene names, other nonstandard aliases, nucleotide and amino acid sequences, and literature citations. In addition, free-text descriptions are associated with predicted gene models, and full-text searching is provided using Textpresso [160]. To enable intra- and cross-species comparisons, when information on characterized genes is curated, TGD staff members capture aspects of a gene product’s biology (i.e., molecular function, biological role, and cellular localization) using terms from the Gene Ontology (http://www.geneontology.org). This is complemented by automated functional annotation of all predicted genes. Other resources include tools for searching the annotation by keywords, similarity searching using BLAST and BLAT, Gbrowse-based genome visualization [161], information about Tetrahymena research laboratories, links to other ciliate-related resources, and various tutorials. The TGD staff is always available to help individual researchers by answering questions, finding information, and generating datasets specific to their needs.

Conclusions and Future Plans

In sequencing and assembling the T. thermophila MAC genome, there were many anticipated major challenges not commonly seen in eukaryotic genome projects. Overall, however, the assemblies are remarkably accurate and represent excellent coverage of the genome. This is likely in large part due to low levels of repetitive DNA, one of the features of the MAC genome that initially led us to select it for sequencing. The sequence data in our current assemblies are certainly complete enough for detailed analyses of the predicted biology of this species as we have reported here and others have shown. In addition, the genome sequence is already being used in many functional genomic studies taking advantage of the powerful experimental tools available. Along these lines, it will be of great value to do comparative analyses with the genome sequences of other ciliates such as P. tetraurelia and Oxytricha trifallax, which are in progress.

One of our main goals is to obtain a complete sequence of the MAC genome, and there are still some challenges left to its achievement. Since we were unable to obtain quality sequence data from large insert clones, any region of the MAC genome containing significant amounts of repetitive DNA would not have assembled well. To overcome this pitfall we are now using HAPPY mapping [162] as an alternative approach to obtaining such linking information. Also, it is known that at least the ends of at least two MAC chromosomes present immediately following conjugation disappear during subsequent vegetative growth, perhaps an indication that these chromosomes are incapable of long-term maintenance [41]. As expected, we do not find sequences corresponding to these ends in our database. Thus alternative methods will be required to obtain the sequences of these regions and any others lost during early vegetative growth. Despite these challenges, all the evidence suggests that it will be possible to close the entire MAC genome.

Of course, the entire MAC genome alone does not provide us with a complete picture of the T. thermophila genome. Sequencing the MIC genome will be more challenging due to the greater abundance of repetitive DNA. However, we will be able to use the MAC genome as a scaffold and thus in a way MIC sequencing will be equivalent to genome closure rather than an independent project. We have already begun in this area by determining the sequence adjacent to MIC Cbs junctions and mapping these to MAC assemblies as well as the reverse—using MAC telomere-adjacent sequences to pull out MIC Cbs-flanking regions [24,41].

Having a MIC sequence and mapping the MIC to the MAC will be useful in understanding many aspects of T. thermophila biology that we cannot study through the MAC. These include centromere function, MIC telomere features, and the extent to which the MAC and MIC in T. thermophila and other ciliates are the equivalent of somatic and germ cells. Perhaps most important, having both genomes will allow detailed analyses of the genome-wide DNA rearrangement process. It is only by having both genome sequences that we can fully understand the biology of this fascinating species.

Materials and Methods

Cell growth, DNA isolation, and library construction.

T. thermophila cell lines currently in laboratory use were first isolated from the wild in the 1950s [163] and were maintained by serial passage and inbreeding for over 16 y before viable freezing methods were developed. Strain SB210 [164] is the end result of about 25 sexual reorganizations in laboratory culture, including a series of sexual inbreedings by the equivalent of brother-sister matings giving rise to the inbred strain B genetic background [165]. Following the final conjugation, a thoroughly assorted cell line was isolated after at least three serial single-cell isolations (SCIs). The last SCI was approximately 150 fissions after conjugation. These serial SCIs provided abundant opportunity to isolate a cell line that had become pure for most of the MAC developmental diversity but not necessarily all because assortment brings about a stochastic, exponential decay in diversity. The chosen cell line was then subjected to a genomic exclusion cross [166], which generates a whole-genome homozygous MIC but does not generate a new MAC. At least one additional SCI occurred at this step, after which this cell line was frozen. As needed, frozen stocks were replenished following a minimal number of vegetative fissions. The strain has been deposited in the Tetrahymena Stock Center at Cornell University as suggested [167].

A culture was started from a fresh thaw of strain SB210. Purified macronuclei were prepared by differential sedimentation, and DNA was extracted from the purified macronuclei as described [168]. The preparation was checked by Southern blot hybridization to verify that the level of contamination with MIC DNA was low. Genomic libraries were prepared as described [169]. DNA was randomly sheared, end-polished with consecutive polynucleotide kinase and T4 DNA polymerase treatments, and size-selected by electrophoresis in 1% low-melting-point agarose. After ligation to BstXI adapters (Invitrogen, Carlsbad, California, United States; catalog No. N408–18), DNA was purified by three rounds of gel electrophoresis to remove excess adapters, and the fragments, now with 3′-CACA overhangs, were inserted into BstXI-linearized plasmid vector (pHOS2, a medium-copy pBR322 derivative) with 3′-TGTG overhangs. Libraries with average sizes of inserts were constructed: 1.8, 2.5, 3.5, 5.0, and 8.5 kb (Table S1). Libraries with larger insert sizes were unstable, presumably due to the high AT content in the genomic DNA.

Sequencing was done from paired-ends primarily at the J. Craig Venter Science Foundation Joint Technology Center. Possible contaminating sequences from other projects have been filtered out using BLASTN searches against all other genome projects conducted at the same time at TIGR and the Joint Technology Center. Whole genome assemblies were performed using the Celera Assembler [37] with modifications implemented by researchers at the J. Craig Venter Science Foundation and TIGR. Sequence reads corresponding to the mitochondrial and rDNA chromosomes were identified using the latest version of the MUMmer program [170] and comparison to the published sequences.

Linking open ends of assembled scaffolds to telomeres.

The initial assembly contained 85 telomere-capped scaffold ends. However, these ends correspond to a minority of the total number of non–rDNA telomere–containing sequence reads, which we estimate to be 4,058. Computational and experimental methods were used to identify and confirm scaffold ends that were very close to a telomere, marking the end of a chromosome.

One method matched read-mates of telomere-containing reads (Tel-reads) that the assembly program failed to incorporate into scaffolds. These were identified by searching the sequence read database for exact matches to a 12-mer encompassing two telomeric repeats (GGGGTTGGGGTT). Read-mates were identified for 95% of the Tel-reads. Two internal 40-nt tags were extracted from each Tel-read mate and tested for at least one exact match with the terminal 5 kb of every scaffold (or the entire scaffold if less than 10 kb long). After clustering the matches, a nonredundant list of Tel-linked scaffold ends was generated.

A second method matched previously identified MIC DNA sequences flanking cloned Cbs junctions to scaffold ends (see Figure 2). Telomeres are added within 30 bp of the Cbs element. Thus, if Cbs-adjacent sequence from MIC DNA can be aligned with a MAC scaffold end, the end can be inferred to be telomere-linked. BLASTN searches were carried out with the “no filter” option because very AT-rich sequence was being compared.

A third method involved PCR walking from scaffold open ends to telomeres. Primers designed from scaffold ends were used in combination with the generic 14-nt telomere primer, 5′-CCCCAACCCCAACC-3′. The authenticity of each PCR product was confirmed by sequencing.

Cloning and sequencing RAPDs and sizing their associated MAC chromosomes.

Conditions and reagents for RAPD PCR were as in [171]. The 10-mer primers were from Operon Technologies. The polymorphic RAPD PCR products were size-fractionated by electrophoresis in a 1.5% agarose gel. Polymorphic bands were excised and the DNA was extracted with a QIAquick gel extraction kit (Qiagen, Chatsworth, California, United States). The DNA was reamplified using the same PCR conditions and primer combination initially used to detect the polymorphism. Amplified fragments were cloned into the pCR2.1-TOPO vector (Invitrogen) according to the manufacturer’s directions. Insert-containing clones, identified as white colonies, were screened for insert size by colony PCR as in [172]. The authenticity of each correctly sized insert was confirmed by hybridization to a Southern blot of RAPD products from a panel of ten Tetrahymena strains in which the alleles of the RAPD locus were meiotically segregating [40].

Plasmid DNA was isolated using a QIAprep Miniprep kit (Qiagen, Valencia, California, United States), and inserts were sequenced using the Big Dye Terminator Cycle-Sequencing-Ready Reaction kit (PE Applied Biosystems, Foster City, California, United States). Nucleotide sequences were determined using an ABI 310 Genetic Analyzer. Insert sequences were then searched against the assemblies using BLASTN.

High-molecular-weight DNA was prepared by embedding live cells from strain SB210 in agarose plugs and lysing them using a modification of Birren and Lai [173]. The DNA plugs were inserted into the wells of a 1% Pulsed Field Certified Agarose gel (Bio-Rad, Hercules, California, United States) in 1× TAE buffer. Preliminary sizing of MAC chromosomes was obtained from gels run using the following conditions: 30 h at 6 V/cm with a 60- to 120-s switch time ramp at an included angle of 120°, 1× TAE recirculated at 10 °C. Running conditions were varied when the above conditions did not provide adequate resolution in the size range of a particular MAC chromosome (E. P. Hamilton, unpublished data). The DNA in the gel was acid-depurinated, neutralized, and transferred to a positively charged nylon membrane by downward alkaline transfer (CHEF-DR III Instruction Manual and Applications Guide; Bio-Rad). After blotting, the DNA was crosslinked to the membrane using a Bio-Rad GS Gene Linker. 32P-labeled probes were made from the PCR products obtained from each RAPD clone. Methods for making probes, Southern hybridization, and autoradiography were as in [40].

cDNA library construction and sequencing.

cDNA libraries were generated from cells in either the conjugative or vegetative stages of the life cycle. For the conjugative library, cells from a mating between strains CU428 and B2086 were harvested at 3, 6, and 10 h after mixing, and RNA was purified using TRIzol. PolyA+ RNA was isolated and cDNA was generated by Amplicon Express (Pullman, Washington, United States). Inserts were cloned into EcoRI and XhoI sites in pBluescript IISK+ (Stratagene, La Jolla, California, United States) and had an average size of 1.4 kb. Clones were picked at random and sequenced from the 5′ end of the transcript using the T3 primer. For the vegetative library, which was made by DNA Technologies (Gaithersburg, Maryland, United States), CU428 cells were harvested in exponential growth and RNA was purified using TRIzol. PolyA+ mRNA was isolated using oligo(dT) cellulose, cDNA was generated, and inserts were cloned into the EcoRV and NotI sites of the pcDNA3.1(+) vector (Invitrogen). Clones were picked at random and sequenced from the 5′ end using the custom pcDNA(−48) primer. All sequences were submitted to the dbEST division of GenBank, to the Taxonomically Broad EST Database (TBestDB) at http://tbestdb.bcm.umontreal.ca/searches/login.php, and to TIGR’s Tetrahymena Gene Index at http://www.tigr.org/tigr-scripts/tgi/T_index.cgi?species=t_thermophila. Subsequent analyses used comparisons of the conjugative sequences with all vegetative sequences including those in GenBank not generated at TIGR.

Functional ncRNA analysis.

Most ncRNA annotations (Table S6) were generated using covariance model (CM) scans [174]. Transfer RNA annotations are those provided by the CM-based tRNAscanSE program [175] run with default parameters. Most other scans were based on CMs defined by the Rfam database [176,177] (release 7.0, March 2005; 503 families). With a few exceptions, we used rigorous filters [178] built from the Rfam models to identify exactly those sequences that match the Rfam models with scores at or above Rfam’s family-specific “gathering” cutoff. One exception was RF00005 (tRNA), as mentioned above. Another exception was RF00012, the U3 small nucleolar RNA, for which the Rfam model found no hits. Instead, we manually added one known Tetrahymena U3 sequence [179] to the Rfam seed alignment, built a CM from it, and rescanned the genome, finding the four U3 sequences reported in Table S6C. The third class of exceptions consisted of the 44 Rfam families using the “local alignment” feature of CMs. These families were scanned using ML-heuristic filters [180], with a scan threshold chosen for each such family such that approximately 1% of the genome was scored by the CM. This setting generally shows good sensitivity but is not guaranteed to find all sequences that match the Rfam model, unlike the rigorous scans above. Hits against the Rfam T_box (RF00230), group I self-splicing introns (RF00028), and ctRNA_pND324 (RF00238) involved in bacterial plasmid copy control all appear implausible and are also unexpected by phylogenetic criteria. Hits against Rfam small nucleolar RNAs (RF00086, RF00133, RF00309) also appeared to be false positives, as were most hits to the iron response element (RF00037) and selenocysteine insertion sequence (RF00031) families. Other families not discussed here or in Table S6 yielded no hits above threshold. See http://www.cs.washington.edu/homes/ruzzo/papers/Tthermophila for full details about the ncRNA scans. It should be noted that our annotation approach may be prone to reporting ncRNA pseudogenes and that its accuracy may be affected by the high AT content of the genome.

Protein-coding gene finding and coding region analysis.

The gene finder TIGRscan ([181], since renamed GeneZilla) was trained for T. thermophila using a two-phase bootstrapping process [182], due to the dearth of curated training data available at the time. In the first round of training (termed “long-ORFs”), all parameters were estimated from a set of 193 full-length cDNAs from the apicomplexan P. falciparum (including surrounding regions from the genomic sequence; 1.6 Mb total) except for the exon state, which was trained on 2,130 nonoverlapping, long ORFs (each at least 3,000 bp in length). The default polyadenylation signal state and TATA-box state for this gene finder utilize human TRANSFAC weight matrices [183]; these were not modified. The gene finder was then used to predict genes in the raw T. thermophila genomic sequence, and the predictions were used to bootstrap the parameter estimation during the second round of training (termed “hybrid”). Sixty curated T. thermophila genes which became available during the second round of training were analyzed and their coding statistics were used to improve the exon state by averaging with the original long-ORF statistics, appropriately weighted to eliminate length bias. Exon length distributions were estimated from the 60-gene set, with appropriate smoothing. Interpolated and noninterpolated Markov chains [184] were utilized by the content states, with the order of dependency (3rd for exons and introns, 0th for intergenic, and 1st for UTR) selected so as to optimize prediction accuracy on the 60-gene set. Splice site and start/stop codon states were re-trained from pooled data consisting of the 60 curated genes and the original P. falciparum training data, using an 80%:20% T. thermophila/P. falciparum weighting to mitigate the effects of overtraining due to small sample sizes in the sixty gene set. Weight matrices utilized by the latter states were reduced to approximately 22 bp when it was noticed that longer matrices interfered with the prediction of short introns. The “hybrid” and “long-ORFs” parameterizations were tested on a set of 300 partial genes inferred from ESTs that were assembled against the chromosomes using the PASA program [52]. The “hybrid” parameterization was chosen because it was about three times more accurate at the exon level than “long-ORFs” (see Table S16).

Multivariate analysis of codon usage was performed with the codonW package (http://codonw.sourceforge.net). Correspondence analysis of relative synonymous codon usage values was carried out to examine the major source of codon usage variation. Amino acid composition of the predicted aggregate proteome was compared with the corresponding data downloaded from dictyBase for the slime mold D. discoideum and from Ensembl for Homo sapiens.

To find candidate tandem gene duplicates, we analyzed pairwise alignments between neighboring genes using BLASTP. An all-versus-all BLASTP search was performed using all Tetrahymena gene-encoded proteins, requiring a maximum E-value of 1e−20, and reporting the best 20 matches. Matching genes found at adjacent genome locations were chained together and reported as candidate tandem gene arrays, allowing only a total of two nonmatching genes to intervene matching genes in a single array.

A Lek clustering algorithm [169] was applied for paralogous gene family classification of the predicted proteins in the T. thermophila genome. All predicted proteins were searched with BLASTP against each other. Links were established between genes at an E-value cutoff of 1 × 10−20. Lek similarity scores, which were defined as the number of BLASTP hits shared by any pair of proteins divided by the combined number of hits for either of the two genes, were calculated for all pairs of proteins. The links for which the Lek similarity scores were above a cutoff of 0.66 were used to build gene family clusters by a single-linkage clustering algorithm. Biological function roles were assigned to the gene families based on the top BLASTP hits for individual genes in each family against a nonredundant protein database.

Organelle-derived genes and APIS.

Searches for plastid and mitochondrial related genes were performed using the APIS program. APIS (J. H. Badger, unpublished data) is a system that automatically generates and summarizes phylogenetic trees for each gene in a genome. It is implemented as a series of Ruby scripts, and the results are viewable on an internal Web server which allows the user to explore the data and results in an interactive manner. APIS obtains homologs by comparing each query protein against a database of proteins from complete genomes, and extracting the full length sequences of homologs with E-values less than 1e−10. The homologs are then aligned by MUSCLE [185] and bootstrapped neighbor-joining trees are produced using QuickTree [186]. As QuickTree (unlike most programs) produces bootstrapped trees with meaningful branch lengths, the trees are then midpoint rooted. Then a taxonomic analysis is performed of the proteins that are neighbors in the tree with the query protein. This analysis makes use of the NCBI taxonomy assigned to the other proteins in the tree. For each taxonomic level (e.g., kingdom, phylum, class, etc.), the query protein is assigned to a bin. If in the tree the query protein is within a clade of sequences that are all from group X (for the taxonomic level being examined) then the query protein is placed in a bin labeled “contained within group X.” If the query protein branches next to (but not within) a clade of sequences from the same group, it is placed in a bin labeled “outgroup of X.” If the neighbors of the query sequence are in multiple groups, no binning is done for that taxonomic level.

Candidates for mitochondrially derived genes were separately identified by BLASTP searches using known mitochondrial proteins as queries [187,188]. Phylogenetic trees were then constructed for individual candidates in the context of all completely sequenced genomes and representatives of mitochondria. Genes whose closest neighbors were exclusively α-proteobacteria and/or mitochondria were classified as possibly mitochondrion derived.

Analysis of repetitive DNA and TEs.

The location and characterization of tandem minisatellite and microsatellite repeats were done using Tandem Repeats Finder [189], using the default parameter values. The location, length, period size, %GC, and consensus sequence of each repeat were extracted for all scaffolds and listed with the scaffold number and size. Vmatch (http://www.vmatch.de) was used to search for repeats that are at least 50 bp long and 100% identical (Table S17). We note that repeats that are larger than the average insert size of our libraries would not be able to be uniquely placed into any assembly by the Celera Assembler and thus do not show up in our analysis.

The T. thermophila genome was searched against two sets of TEs using BLASTN and/or TBLASTN [190], with default parameters and E-value cutoff at 1 × 10−5. One of the TE sets consisted of 12 complete or partial ciliate TEs, namely Tec1, Tec2, and Tec3 from Moneuplotes crassus, TBE1 from O. fallax, and REP1, REP2.2, REP3, REP6, TIE1, TIE2, TIE3, and Tlr from T. thermophila [90,91,191,192]. The other TE set consisted of 44 representative elements of the transposon superfamily mariner/Tc1/IS630 [192], including members of the mariner, Tc1, DD39D (plant), DD37D (nematodes and insects), and DD37D (mosquitoes), Ant1/Tec, and Pogo families. In addition, the genome was scanned for homology to TE-encoded ORFs using PSI-TBLASTN [190]. Briefly, a reference ORF from each major family of autonomous transposons and retrotransposons was searched against the nonredundant protein database using BLAST-PGP with two iterations, generating a TE ORF family-specific profile. Each reference TE ORF and corresponding family profile were searched against the genomic sequence using PSI-TBLASTN, and all matches with E-value at most 1e−5 were captured for subsequent analysis. Finally, a few scaffolds with putatively complete transposases belonging to the mariner/Tc1/IS630 superfamily were further investigated for the presence of the inverted terminal repeats (ITRs) that typically flank these elements. Identification of paired ITRs was done using Owen [193] and searches were done against known consensus ITR sequences of mariner and Tc1 elements to find individual ITRs.

Analysis of functional categories with gene family expansions.

Protein kinase genes were identified by comparison of peptide predictions to a set of protein kinase profile hidden Markov models [104] and by BLAST against divergent kinase sequences. A small number of gene predictions were split or fused to adjacent predictions based on presence of split or multiple kinase domains. Kinases were classified by comparison of kinase domain sequences to a set of group-, family-, and subfamily-specific hidden Markov models as well as by BLAST-based clustering of T. thermophila and previously classified kinases.

Predicted protein sequences were searched against a curated database of membrane transport proteins [113] for similarity to known or putative transport proteins using BLASTP. All proteins with significant hits (E-value less than 0.001) were collected and searched against the NCBI nonredundant protein and Pfam databases [194]. Transmembrane protein topology was predicted by TMHMM [195]. A Web-based interface was implemented to facilitate the annotation processes, which incorporates number of hits to the transporter database; BLAST and hidden Markov model search E-value and score; number of predicted transmembrane segments; and the description of top hits to the nonredundant protein database (http://www.membranetransport.org) [113,196].

A total of over 30,000 sequences of characterized and predicted proteases were obtained from the Merops database (http://www.merops.ac.uk release 7.00) [119]. These sequences were searched against the T. thermophila predicted protein sequences using BLASTP with default settings and an E-value cutoff of less than 10−10 for defining protease homologs. Partial sequences (less than 80% of full-length) and redundant sequences were excluded. The domain/motif organization of predicted T. thermophila proteases was revealed by an InterPro search. For each putative protease, the known protease sequence or domain with the highest similarity was used as a reference for annotation; the catalytic type and protease family were predicted in accordance with the classification in Merops, and the enzyme was named in accordance with SWISS-PROT enzyme nomenclature (http://www.expasy.ch/cgi-bin/lists?peptidas.txt) and literature.

Tubulin superfamily genes were identified by a BLASTP search using T. thermophila α-tubulin Atu1p as the query. Twenty-one candidate predicted ORFs were identified, but two showed only moderate sequence similarity to either the amino- (TTHERM_00834920) or the carboxyl- (TTHERM_00896110) terminal halves of α- or β- tubulin and were not considered further. The 19 remaining were aligned with representative tubulins from other organisms and a neighbor-joining tree constructed using default settings of ClustalX (version 1.81) with 1,000 bootstrap runs. A prokaryotic tubulin ortholog, Escherichia coli FtsZ, was used as the outgroup (see Figure 7).

Using dynein subunit sequences obtained in the green alga C. reinhardtii or in other species when appropriate, we searched the T. thermophila MAC genome for orthologous sequence with TBLASTN. Candidate sequences were aligned with the sequences available in the databases of dynein subunits characterized in other experimental systems. Exon-intron borders were first approximated using the characteristics of the 64 introns previously experimentally determined in three dynein heavy chains, DYH1, DYH2, and DYH4. The 64 T. thermophila introns are AT rich (average 88%), are bounded by 5′-GT and AG-3′ and are relatively short (average 80 nucleotides; range, 50 to 332). The exon-intron borders and the expression of each gene were confirmed by RNA-directed PCR and, if necessary, sequencing of the amplified RT-PCR product. The verification of the exon-intron organizations of most of the heavy chains has not been completed.

Peptide sequence of Rab1A from H. sapiens was used to query T. thermophila gene predictions using BLASTP. Candidate Rab homologs were screened to include predicted proteins with complete Rab domains. These sequences were individually used in BLASTP searches of GenBank to confirm that Rab proteins from another species were the closest match. The minimum E score cut-off was 5e−13, but the majority of homologs scored better than 1e−30. The top scoring Rab1 homolog from T. thermophila (TTHERM_00316280) was used in an additional BLASTP search of the T. thermophila genome to confirm that all Rab homologs were identified by the initial query. Homologs of other GTPases in the Rabl, Ral, Rap, Ras, Rho, and Arf families began to appear along with the lower scoring Rab homologs and were discarded from the set. Rab protein sequences from H. sapiens (Ensembl database), Drosophila melanogaster (Flybase), and S. cerevisiae (Saccharomyces Genome Database), along with those identified as described above from T. thermophila, were aligned using ClustalX. The alignment was refined by eye and gaps removed. The tree in Figure S7 was generated using the neighbor-joining module in Phylip 3.6. Trees constructed using maximum-likelihood and parsimony methods largely corroborated this topology. T. thermophila Rab homologs associated with clades of previously identified Rabs were given putative names where consistent BLASTP results were evident and are arranged in Table S15 according to functional groups. Preliminary annotations from the TGD were queried to identify predicted coat protein homologs. Others were identified in queries with peptide sequence from D. melanogaster homologs. T. thermophila homologs were used in BLASTP queries of GenBank to confirm annotations. Further analysis of AP subunits, clathrin, and dynamin-related proteins is found in [96].

Sequence availability.

All of the sequences, assemblies, and gene predictions can be downloaded from the TIGR ftp site (ftp://ftp.tigr.org/pub/data/Eukaryotic_Projects/t_thermophila). The sequence reads and traces can be downloaded from the NCBI trace archive at ftp://ftp.ncbi.nih.gov/pub/TraceDB/tetrahymena_thermophila. Assemblies, sequence reads, and gene predictions can be searched using multiple similarity search methods at the TIGR, TGD, and NCBI Web sites. Sequences are also available in Genbank (see below).

Supporting Information

Figure S1. Nucleotide Composition

(A) Scaffolds larger than 1 Mb were sorted by size and concatenated to make a pseudo molecule. Statistics of nucleotide composition were calculated for 2,000 bp sliding windows with a shift length of 1,000 bp. Yellow, GC skew; blue, GC%; purple, χ2 score. The green lines delimit the scaffolds (long) or contigs within each scaffold (short).

(B) Analysis of three T. thermophila scaffolds of diverse size. Red boxes, genes on forward strand; green boxes, genes on reverse strand; blue, χ2 score; orange, GC%; brown, GC skew; salmon, AT skew. The vertical light gray lines delimit contigs within each scaffold. Scaffold sizes: 8254645, 1,076 kb; 8254654, 510 kb; 8254072, 37.3 kb.

(246 KB PDF)

Figure S2. Gene Density Distribution

Using scaffolds larger than 100 kb, the percentage of predicted gene coding sequence was calculated within 10-kb windows. For the overall gene density (black bars), a sliding 10-kb window was applied at 2-kb intervals. Gray bars represent gene density in the 10-kb adjacent to each telomere.

(92 KB PDF)

Figure S3. Intron Size Distribution

Comparison of the percentage of introns in various size classes for both ab initio predicted genes (gray bars) and introns confirmed by EST sequencing (black bars).

(17 KB PDF)

Figure S4. Expression of tRNA and Other ncRNAs

(A) tRNA charging and expression. Total RNA was harvested from T. thermophila in log-phase growth (lanes 1 and 2) or after resuspension in 10 mM Tris starvation buffer for the times indicated. Total RNA samples were resolved by acid/urea acrylamide gel electrophoresis and transferred to nylon membrane; the same total RNA sample either untreated or deacylated at alkaline pH was used for lanes 1 and 2. Probing was performed using end-radiolabeled oligonucleotides specific for the tRNA of interest.

(B) Expression levels of ncRNAs under various conditions. Total RNA was harvested from T. thermophila under the growth or development conditions indicated, resolved, transferred, and probed as in (A). As an internal control for even loading, the same blot was hybridized to detect tRNA-Sec and SRP RNA (RNA PolIII transcripts found predominantly in the cytoplasm and involved in translation) and also to U1 and U2 snRNAs (RNA PolII transcripts found predominantly in the nucleus and involved in mRNA splicing).

(420 KB PDF)

Figure S5. Distribution of Repeat Content versus Scaffold Size

Orange points represent scaffolds that have been capped with telomeres at both ends.

(30 KB PDF)

Figure S6. Expansion of the Polo Kinase Family in T. thermophila Compared with Selected Eukaryotes

Neighbor-joining tree built from ClustalW alignment of polo kinase domains. Species abbreviations: Hs, H. sapiens; Dm, D. melanogaster; Ce, Caenorhabditis elegans; Sc, S. cerevisiae; Dd, D. discoideum; Tt, T. thermophila. Note that T. thermophila has multiple members of both the polo and sak subfamilies, and that even within the T. thermophila–specific cluster, sequences are as divergent as orthologs from vertebrates and lower metazoans. The bar indicates scale of average substitutions per site.

(71 KB PDF)

Figure S7. Phylogenetic Analysis of Rabs

Unrooted neighbor-joining tree for Rab GTPases. Bootstrap values over 40% (from 100 replicates) are indicated near corresponding branches. Predicted T. thermophila genes are in bold. Other Rabs are from H. sapiens (Hs), D. melanogaster (Dm), and S. cerevisiae (Sc). Proposed Rab families [157] are shown in colored blocks. Asterisks indicate Rabs for which there is functional evidence (**) or at least localization data (*) consistent with their groupings. T. thermophila genes cluster with the members of each Rab family except VII and IV (not shown in a box). There are three clades comprised exclusively of T. thermophila gene predictions (clades I, II, and III) shown in dark gray boxes.

(39 KB PDF)

Table S1. Genomic DNA Libraries

(28 KB DOC)

Table S2. Statistics on Chromosome Assemblies and Satellite Repeats

(52 KB DOC)

Table S3. Scaffolds Capped by Telomeres

(352 KB DOC)

Table S4. Matches of RAPD DNA Polymorphisms to Scaffolds

(167 KB DOC)

Table S5. T. thermophila ESTs, including Available GenBank Entries

(30 KB DOC)

Table S6. ncRNAs

(A) 5S.

(B) tRNA.

(C) Other ncRNAs.

(D) tRNA gene IDs.

(1.0 MB DOC)

Table S7. Genes Predicted to Be Highly Expressed on the Basis of Codon Usage Bias

(388 KB DOC)

Table S8. Likely Mitochondrion-Derived Genes from the T. thermophila Macronuclear Genome

(114 KB DOC)

Table S9. Scaffolds with Similarity to Members of the mariner/Tc1/IS630 Superfamily

(73 KB DOC)

Table S10. Recent Gene Duplications

(1.9 MB DOC)

Table S11. Expanded Versions of Tables 5 through 8, including TIGR and GenBank IDs for All the Identified Genes

(A) Kinases.

(B) Membrane transporters.

(C) Proteases.

(D) Cytoskeletal related.

(3.6 MB DOC)

Table S12. Human Disease Genes with Orthologs in T. thermophila, but Not the Yeast S. cerevisiae

(90 KB DOC)

Table S13. Dynein Subunit Genes in T. thermophila

(134 KB DOC)

Table S14. Membrane Traffic Component Homologs in T. thermophila

(59 KB DOC)

Table S15. Rab Homologs in the T. thermophila Genome Assembly

(159 KB DOC)

Table S16. Testing Different Gene Finder Parameterizations

(25 KB DOC)

Table S17. The 50 Longest 100% Identical Repeats

(93 KB DOC)

Accession Numbers

The GenBank (http://www.ncbi.nlm.nih.gov/Genbank) accession numbers for the T. thermophila genes are TTHERM_00047660, 00141160, 00279820, 00486500, 00522580, and 00823430 and for three dynein heavy chains, DYH1, DYH2, and DYH4, are AF346733, AY770505, and AF072878, respectively. The sequence contigs (AAGF01000001 to AAGF01002955), the scaffold assemblies (CH445395 to CH445797 and CH670346 to CH671913), and the gene predictions (EAR80512 to EAS07932) are available from GenBank. The Gene Identification numbers in Figure 7 obtained from JGI Chlamy v2.0 (http://genome.jgi-psf.org/chlre2/chlre2.home.html) are Ec_FtsZ, 16128088; Dm_alpha-1, 135396; Hs_alpha-1, 5174477; Cr_alpha-1, 135394; Tb_alpha,135440; Sc_alpha, 1729835; Pt_alpha, 1460090; Dm_beta-1, 158739; Hs_beta-1, 135448; Cr_beta, 8928401; Tb_beta, 135500; Pt_beta-1, 417854; Sc_beta, 1174608; Dm_gamma-1, 45644955; Hs_gamma-1, 31543831; Sc_gamma, 1729859; Cr_gamma, 8928436; Pt_delta, 10637981; Hs_delta, 50592998; Cr_delta, 75277286; Tb_delta, 13508430; Hs_epsilon, 7705915; Pt_epsilon, 18477270; Tb_epsilon, 259797; Xl_eta, 4266842; Pt_eta, 9501681; Tb_zeta, 7341314; Pt_iota, 18478276; Pt_theta, 18478274; Pt_kappa, 32812838; and Cr_epsilon (C_460065). The Ensembl Gene ID (http://www.ensembl.org) for Rab1A from H. sapiens is ENSG00000138069.

Acknowledgments

We would like to acknowledge the Tetrahymena research community and the members of our Tetrahymena Scientific Advisory Board for advice, support, encouragement, and assistance. In addition, we would like to specifically acknowledge many people for assistance: John Gill (sample tracking); Hean Koo (contaminant identification and trace archive and EST submission); Shannon Smith, Susan van Aken, and William Nierman (library construction); Sam Angiuoli (Web and BLAST page maintenance); Jeff Shao (database construction); Jessica Vamathevan (initial work on genome closure); Tamara Feldblyum, Terry Utterback, and the staff at the J. Craig Venter Institute’s Joint Technology Center (sequencing); Lauren Smith and Jyoti Shetty (fosmid construction); Malcolm Gardner (advice); Martin Shumway (general software engineering support); Owen White (general informatics support); Leslie Bisignano and Lynn McKenna (grants support); Aimee Turner (financial operations); Tinu Akinyemi (administrative support); and Claire Fraser (for supporting the scientific research within TIGR).

Author contributions. JAE coordinated the project. JAE, RSC, EPH, and EO wrote and edited the majority of the manuscript. JAE, RSC, MW, DW, JHB, and MT performed multiple bioinformatics analyses. MT, JRW, PA, MF, RKS, and BJH coordinated the annotation. KMJ and LJT carried out genome closure. ALD and SLS generated and analyzed genome assemblies. JCS, KMK, and LS analyzed mobile DNA elements. WHM generated gene models. QR conducted analyses of membrane transporters. JMC, JG, and REP generated and analyzed ESTs. GM analyzed protein kinases. NCE and APT analyzed membrane trafficking. DJA and DEW analyzed dyneins. YW and HC analyzed proteases. KC, BAS, SRL, WLR, KW, and ZW analyzed ncRNA. DW, JG, MAG, JF, and CCT analyzed cytoskeletal associated proteins. PJK, RFW, NJP, and JHB searched for plastid-derived genes. JMC, NAS, and CJK built TGD. CdT, HFR, SCW, and RAB performed the RAPD analyses. EPH, EO, SLS, JAE, and MW examined genome structure.

References

  1. Collins K, Gorovsky MA (2005) Tetrahymena thermophila. Curr Biol 15: R317–R318. Find this article online
  2. Nanney DL, Simon EM (2000) Laboratory and evolutionary history of Tetrahymena thermophila. Methods Cell Biol 62: 3–25. Find this article online
  3. Zaug AJ, Cech TR (1986) The intervening sequence RNA of Tetrahymena is an enzyme. Science 231: 470–475. Find this article online
  4. Blackburn EH, Gall JG (1978) A tandemly repeated sequence at the termini of the extrachromosomal ribosomal RNA genes in Tetrahymena. J Mol Biol 120: 33–53. Find this article online
  5. Yao MC, Yao CH (1981) Repeated hexanucleotide C-C-C-C-A-A is present near free ends of macronuclear DNA of Tetrahymena. Proc Natl Acad Sci U S A 78: 7436–7439. Find this article online
  6. Greider CW, Blackburn EH (1985) Identification of a specific telomere terminal transferase activity in Tetrahymena extracts. Cell 43: 405–413. Find this article online
  7. Brownell JE, Zhou J, Ranalli T, Kobayashi R, Edmondson DG, et al. (1996) Tetrahymena histone acetyltransferase A: A homolog to yeast Gcn5p linking histone acetylation to gene activation. Cell 84: 843–851. Find this article online
  8. Asai DJ, Forney JD (2000) Tetrahymena thermophila San Diego: Academic Press. 580 p.
  9. Turkewitz AP, Orias E, Kapler G (2002) Functional genomics: The coming of age for Tetrahymena thermophila. Trends Genet 18: 35–40. Find this article online
  10. Kim K, Weiss LM (2004) Toxoplasma gondii: The model apicomplexan. Int J Parasitol 34: 423–432. Find this article online
  11. Donald RG, Roos DS (1998) Gene knock-outs and allelic replacements in Toxoplasma gondii: HXGPRT as a selectable marker for hit-and-run mutagenesis. Mol Biochem Parasitol 91: 295–305. Find this article online
  12. Radke JR, Behnke MS, Mackey AJ, Radke JB, Roos DS, et al. (2005) The transcriptome of Toxoplasma gondii. BMC Biol 3: 26. Find this article online
  13. Peterson DS, Gao Y, Asokan K, Gaertig J (2002) The circumsporozoite protein of Plasmodium falciparum is expressed and localized to the cell surface in the free-living ciliate Tetrahymena thermophila. Mol Biochem Parasitol 122: 119–126. Find this article online
  14. Prescott DM (1994) The DNA of ciliated protozoa. Microbiol Rev 58: 233–267. Find this article online
  15. Martindale DW, Allis CD, Bruns PJ (1982) Conjugation in Tetrahymena thermophila. A temporal analysis of cytological stages. Exp Cell Res 140: 227–236. Find this article online
  16. Yao MC, Chao JL (2005) RNA-guided DNA deletion in Tetrahymena: An RNAi-based mechanism for programmed genome rearrangements. Annu Rev Genet 39: 537–559. Find this article online
  17. Yao MC, Duharcourt S, Chalker DL (2002) Genome-wide rearrangements of DNA in ciliates. In: Craig N, Craigie R, Gellert M, Lambowitz A Mobile DNA II Herndon (Virginia): ASM Press. pp. 730–758.
  18. Yao MC, Choi J, Yokoyama S, Austerberry CF, Yao CH (1984) DNA elimination in Tetrahymena: A developmental process involving extensive breakage and rejoining of DNA at defined sites. Cell 36: 433–440. Find this article online
  19. Yao MC, Gorovsky MA (1974) Comparison of the sequences of macro- and micronuclear DNA of Tetrahymena pyriformis. Chromosoma 48: 1–18. Find this article online
  20. Iwamura Y, Sakai M, Muramatsu M (1982) Rearrangement of repeated DNA sequences during development of macronucleus in Tetrahymena thermophila. Nucleic Acids Res 10: 4279–4291. Find this article online
  21. Jenuwein T (2002) Molecular biology. An RNA-guided pathway for the epigenome. Science 297: 2215–2218. Find this article online
  22. Selker EU (2003) Molecular biology. A self-help guide for a trim genome. Science 300: 1517–1518. Find this article online
  23. Fan Q, Yao MC (2000) A long stringent sequence signal for programmed chromosome breakage in Tetrahymena thermophila. Nucleic Acids Res 28: 895–900. Find this article online
  24. Hamilton EP, Williamson S, Dunn S, Merriam V, Lin C, et al. (2006) The highly conserved family of Tetrahymena thermophila chromosome breakage elements contains an invariant 10-base-pair core. Eukaryot Cell 5: 771–780. Find this article online
  25. Yao MC, Yao CH, Monks B (1990) The controlling sequence for site-specific chromosome breakage in Tetrahymena. Cell 63: 763–772. Find this article online
  26. Fan Q, Yao M (1996) New telomere formation coupled with site-specific chromosome breakage in Tetrahymena thermophila. Mol Cell Biol 16: 1267–1274. Find this article online
  27. Yu GL, Blackburn EH (1991) Developmentally programmed healing of chromosomes by telomerase in Tetrahymena. Cell 67: 823–832. Find this article online
  28. Altschuler MI, Yao MC (1985) Macronuclear DNA of Tetrahymena thermophila exists as defined subchromosomal-sized molecules. Nucleic Acids Res 13: 5817–5831. Find this article online
  29. Conover RK, Brunk CF (1986) Macronuclear DNA molecules of Tetrahymena thermophila. Mol Cell Biol 6: 900–905. Find this article online
  30. Kapler GM (1993) Developmentally regulated processing and replication of the Tetrahymena rDNA minichromosome. Curr Opin Genet Dev 3: 730–735. Find this article online
  31. Doerder FP, Deak JC, Lief JH (1992) Rate of phenotypic assortment in Tetrahymena thermophila. Dev Genet 13: 126–132. Find this article online
  32. Ray C Jr (1956) Preparation of chromosomes of Tetrahymena pyriformis for photomicrography. Stain Technol 31: 271–274. Find this article online
  33. LaFountain JR Jr, Davidson LA (1979) An analysis of spindle ultrastructure during prometaphase and metaphase of micronuclear division in Tetrahymena. Chromosoma 75: 293–308. Find this article online
  34. LaFountain JR Jr, Davidson LA (1980) An analysis of spindle ultrastructure during anaphase of micronuclear division in Tetrahymena. Cell Motil 1: 41–61. Find this article online
  35. Mochizuki K, Gorovsky MA (2004) Small RNAs in genome rearrangement in Tetrahymena. Curr Opin Genet Dev 14: 181–187. Find this article online
  36. Orias E (2000) Toward sequencing the Tetrahymena genome: Exploiting the gift of nuclear dimorphism. J Eukaryot Microbiol 47: 328–333. Find this article online
  37. Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, et al. (2000) A whole-genome assembly of Drosophila. Science 287: 2196–2204. Find this article online
  38. Brunk CF, Lee LC, Tran AB, Li J (2003) Complete sequence of the mitochondrial genome of Tetrahymena thermophila and comparative methods for identifying highly divergent genes. Nucleic Acids Res 31: 1673–1682. Find this article online
  39. Engberg J, Nielsen H (1990) Complete sequence of the extrachromosomal rDNA molecule from the ciliate Tetrahymena thermophila strain B1868VII. Nucleic Acids Res 18: 6915–6919. Find this article online
  40. Wong L, Klionsky L, Wickert S, Merriam V, Orias E, et al. (2000) Autonomously replicating macronuclear DNA pieces are the physical basis of genetic coassortment groups in Tetrahymena thermophila. Genetics 155: 1119–1125. Find this article online
  41. Cassidy-Hanley D, Bisharyan Y, Fridman V, Gerber J, Lin C, et al. (2005) Genome-wide characterization of Tetrahymena thermophila chromosome breakage sites. II. Physical and genetic mapping. Genetics 170: 1623–1631. Find this article online
  42. Yao MC, Zheng K, Yao CH (1987) A conserved nucleotide sequence at the sites of developmentally regulated chromosomal breakage in Tetrahymena. Cell 48: 779–788. Find this article online
  43. Karrer KM (2000) Tetrahymena genetics: Two nuclei are better than one. Methods Cell Biol 62: 127–186. Find this article online
  44. Cervantes MD, Xi X, Vermaak D, Yao MC, Malik HS (2006) The CNA1 histone of the ciliate Tetrahymena thermophila is essential for chromosome segregation in the germline micronucleus. Mol Biol Cell 17: 485–497. Find this article online
  45. Pryde FE, Gorham HC, Louis EJ (1997) Chromosome ends: All the same under their caps. Curr Opin Genet Dev 7: 822–828. Find this article online
  46. Wellinger RJ, Sen D (1997) The DNA structures at the ends of eukaryotic chromosomes. Eur J Cancer 33: 735–749. Find this article online
  47. Barry JD, Ginger ML, Burton P, McCulloch R (2003) Why are parasite contingency genes often associated with telomeres? Int J Parasitol 33: 29–45. Find this article online
  48. Gao W, Khang CH, Park SY, Lee YH, Kang S (2002) Evolution and organization of a highly dynamic, subtelomeric helicase gene family in the rice blast fungus Magnaporthe grisea. Genetics 162: 103–112. Find this article online
  49. Mefford HC, Trask BJ (2002) The complex structure and dynamic evolution of human subtelomeres. Nat Rev Genet 3: 91–102. Find this article online
  50. Teunissen AW, Steensma HY (1995) Review: The dominant flocculation genes of Saccharomyces cerevisiae constitute a new subtelomeric gene family. Yeast 11: 1001–1013. Find this article online
  51. Louis EJ (1995) The chromosome ends of Saccharomyces cerevisiae. Yeast 11: 1553–1573. Find this article online
  52. Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK Jr, et al. (2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res 31: 5654–5666. Find this article online
  53. Calzone FJ, Stathopoulos VA, Grass D, Gorovsky MA, Angerer RC (1983) Regulation of protein synthesis in Tetrahymena. RNA sequence sets of growing and starved cells. J Biol Chem 258: 6899–6905. Find this article online
  54. Zagulski M, Nowak JK, Le Mouel A, Nowacki M, Migdalski A, et al. (2004) High coding density on the largest Paramecium tetraurelia somatic chromosome. Curr Biol 14: 1397–1404. Find this article online
  55. Erdmann VA, Wolters J, Huysmans E, Vandenberghe A, De Wachter R (1984) Collection of published 5S and 5.8S ribosomal RNA sequences. Nucleic Acids Res 12(Suppl): r133–r166. Find this article online
  56. Luehrsen KR, Fox GE, Woese CR (1980) The sequence of Tetrahymena thermophila 5S ribosomal ribonucleic acid. Curr Microbiol 4: 123–126. Find this article online
  57. Kimmel AR, Gorovsky MA (1976) Numbers of 5S and tRNA genes in macro- and micronuclei of Tetrahymena pyriformis. Chromosoma 54: 327–337. Find this article online
  58. Horowitz S, Gorovsky MA (1985) An unusual genetic code in nuclear genes of Tetrahymena. Proc Natl Acad Sci U S A 82: 2452–2455. Find this article online
  59. Driscoll DM, Copeland PR (2003) Mechanism and regulation of selenoprotein synthesis. Annu Rev Nutr 23: 17–40. Find this article online
  60. Hatfield DL, Gladyshev VN (2002) How selenium has altered our understanding of the genetic code. Mol Cell Biol 22: 3565–3576. Find this article online
  61. Shrimali RK, Lobanov AV, Xu XM, Rao M, Carlson BA, et al. (2005) Selenocysteine tRNA identification in the model organisms Dictyostelium discoideum and Tetrahymena thermophila. Biochem Biophys Res Commun 329: 147–151. Find this article online
  62. Wuitschick JD, Karrer KM (1999) Analysis of genomic G + C content, codon usage, initiator codon context and translation termination sites in Tetrahymena thermophila. J Eukaryot Microbiol 46: 239–247. Find this article online
  63. Wuitschick JD, Karrer KM (2000) Codon usage in Tetrahymena thermophila. Methods Cell Biol 62: 565–568. Find this article online
  64. Eichinger L, Pachebat JA, Glockner G, Rajandream MA, Sucgang R, et al. (2005) The genome of the social amoeba Dictyostelium discoideum. Nature 435: 43–57. Find this article online
  65. Kanaya S, Yamada Y, Kinouchi M, Kudo Y, Ikemura T (2001) Codon usage and tRNA genes in eukaryotes: Correlation of codon usage diversity with translation efficiency and with CG-dinucleotide usage as assessed by multivariate analysis. J Mol Evol 53: 290–298. Find this article online
  66. Lynch M, Conery JS (2003) The origins of genome complexity. Science 302: 1401–1404. Find this article online
  67. Katz LA, Snoeyenbos-West O, Doerder FP (2006) Patterns of protein evolution in Tetrahymena thermophila: Implications for estimates of effective population size. Mol Biol Evol 23: 608–614. Find this article online
  68. Fast NM, Xue L, Bingham S, Keeling PJ (2002) Re-examining alveolate evolution using multiple protein molecular phylogenies. J Eukaryot Microbiol 49: 30–37. Find this article online
  69. Gajadhar AA, Marquardt WC, Hall R, Gunderson J, Ariztia-Carmona EV, et al. (1991) Ribosomal RNA sequences of Sarcocystis muris, Theileria annulata and Crypthecodinium cohnii reveal evolutionary relationships among apicomplexans, dinoflagellates, and ciliates. Mol Biochem Parasitol 45: 147–154. Find this article online
  70. Gardner MJ, Williamson DH, Wilson RJ (1991) A circular DNA in malaria parasites encodes an RNA polymerase like that of prokaryotes and chloroplasts. Mol Biochem Parasitol 44: 115–123. Find this article online
  71. Cavalier-Smith T (1999) Principles of protein and lipid targeting in secondary symbiogenesis: Euglenoid, dinoflagellate, and sporozoan plastid origins and the eukaryote family tree. J Eukaryot Microbiol 46: 347–366. Find this article online
  72. Gardner MJ, Hall N, Fung E, White O, Berriman M, et al. (2002) Genome sequence of the human malaria parasite Plasmodium falciparum. Nature 419: 498–511. Find this article online
  73. Regoes A, Zourmpanou D, Leon-Avila G, van der Giezen M, Tovar J, et al. (2005) Protein import, replication, and inheritance of a vestigial mitochondrion. J Biol Chem 280: 30557–30563. Find this article online
  74. The Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 796–815. Find this article online
  75. Dyall SD, Brown MT, Johnson PJ (2004) Ancient invasions: From endosymbionts to organelles. Science 304: 253–257. Find this article online
  76. Ralph SA, van Dooren GG, Waller RF, Crawford MJ, Fraunholz MJ, et al. (2004) Tropical infectious diseases: Metabolic maps and functions of the Plasmodium falciparum apicoplast. Nat Rev Microbiol 2: 203–216. Find this article online
  77. Erwin JA, Beach D, Holz GG Jr (1966) Effect of dietary cholesterol on unsaturated fatty acid biosynthesis in a ciliated protozoan. Biochim Biophys Acta 125: 614–616. Find this article online
  78. Holz GG Jr, Erwin J, Rosenbaum N, Aaronson S (1962) Triparanol inhibition of Tetrahymena, and its prevention by lipids. Arch Biochem Biophys 98: 312–322. Find this article online
  79. Holz GG Jr, Wagner B, Erwin J, Britt JJ, Bloch K (1961) Sterol requirements of a ciliate Tetrahymena corlissi Th-X. I. A nutritional analysis of the sterol requirements of T. corlissi Th-X. II. Metabolism of tritiated lopohenol in T. corlissi Th-X. Comp Biochem Physiol 2: 202–217. Find this article online
  80. Corliss JO (1979) The impact of electron microscopy on ciliate systematics. Am Zool 19: 573–587. Find this article online
  81. Lynn DH (1981) The organization and evolution of microtubular organelles in ciliated protozoa. Biol Rev 56: 243–292. Find this article online
  82. Abrahamsen MS, Templeton TJ, Enomoto S, Abrahante JE, Zhu G, et al. (2004) Complete genome sequence of the apicomplexan, Cryptosporidium parvum Science 304: 441–445. Find this article online
  83. Huang J, Mullapudi N, Lancto CA, Scott M, Abrahamsen MS, et al. (2004) Phylogenomic evidence supports past endosymbiosis, intracellular and horizontal gene transfer in Cryptosporidium parvum. Genome Biol 5: R88. Find this article online
  84. Galagan JE, Calvo SE, Borkovich KA, Selker EU, Read ND, et al. (2003) The genome sequence of the filamentous fungus Neurospora crassa. Nature 422: 859–868. Find this article online
  85. Galagan JE, Selker EU (2004) RIP: The evolutionary cost of genome defense. Trends Genet 20: 417–423. Find this article online
  86. Liu Y, Song X, Gorovsky MA, Karrer KM (2005) Elimination of foreign DNA during somatic differentiation in Tetrahymena thermophila shows position effect and is dosage dependent. Eukaryot Cell 4: 421–431. Find this article online
  87. Mochizuki K, Fine NA, Fujisawa T, Gorovsky MA (2002) Analysis of a piwi-related gene implicates small RNAs in genome rearrangement in tetrahymena. Cell 110: 689–699. Find this article online
  88. Yao MC, Fuller P, Xi X (2003) Programmed DNA deletion as an RNA-guided system of genome defense. Science 300: 1581–1584. Find this article online
  89. Doerder FP, Gates MA, Eberhardt FP, Arslanyolu M (1995) High frequency of sex and equal frequencies of mating types in natural populations of the ciliate Tetrahymena thermophila. Proc Natl Acad Sci U S A 92: 8715–8718. Find this article online
  90. Fillingham JS, Thing TA, Vythilingum N, Keuroghlian A, Bruno D, et al. (2004) A nonlong terminal repeat retrotransposon family is restricted to the germ line micronucleus of the ciliated protozoan Tetrahymena thermophila. Eukaryot Cell 3: 157–169. Find this article online
  91. Wuitschick JD, Gershan JA, Lochowicz AJ, Li S, Karrer KM (2002) A novel family of mobile genetic elements is limited to the germline genome in Tetrahymena thermophila. Nucleic Acids Res 30: 2524–2537. Find this article online
  92. Pritham EJ, Feschotte C, Wessler SR (2005) Unexpected diversity and differential success of DNA transposons in four species of entamoeba protozoans. Mol Biol Evol 22: 1751–1763. Find this article online
  93. Silva JC, Bastida F, Bidwell SL, Johnson PJ, Carlton JM (2005) A potentially functional mariner transposable element in the protist Trichomonas vaginalis. Mol Biol Evol 22: 126–134. Find this article online
  94. Foss EJ, Garrett PW, Kinsey JA, Selker EU (1991) Specificity of repeat-induced point mutation (RIP) in Neurospora: Sensitivity of nonNeurospora sequences, a natural diverged tandem duplication, and unique DNA adjacent to a duplicated region. Genetics 127: 711–717. Find this article online
  95. Bowman GR, Smith DG, Michael Siu KW, Pearlman RE, Turkewitz AP (2005) Genomic and proteomic evidence for a second family of dense core granule cargo proteins in Tetrahymena thermophila. J Eukaryot Microbiol 52: 291–297. Find this article online
  96. Elde NC, Morgan G, Winey M, Sperling L, Turkewitz AP (2005) Elucidation of clathrin-mediated endocytosis in Tetrahymena reveals an evolutionarily convergent recruitment of dynamin. PLoS Genetics 1: e52. DOI: 10.1371/journal.pgen.0010052.
  97. Herrmann L, Erkelenz M, Aldag I, Tiedtke A, Hartmann MW (2006) Biochemical and molecular characterisation of Tetrahymena thermophila extracellular cysteine proteases. BMC Microbiol 6: 19. Find this article online
  98. Kuribara S, Kato M, Kato-Minoura T, Numata O (2006) Identification of a novel actin-related protein in Tetrahymena cilia. Cell Motil Cytoskeleton 63: 437–446. Find this article online
  99. Lee SR, Collins K (2006) Two classes of endogenous small RNAs in Tetrahymena thermophila. Genes Dev 20: 28–33. Find this article online
  100. Stemm-Wolf AJ, Morgan G, Giddings TH Jr, White EA, Marchione R, et al. (2005) Basal body duplication and maintenance require one member of the Tetrahymena thermophila centrin gene family. Mol Biol Cell 16: 3606–3619. Find this article online
  101. Wickstead B, Gull K (2006) A “holistic” kinesin phylogeny reveals new kinesin families and predicts protein functions. Mol Biol Cell 17: 1734–1743. Find this article online
  102. Williams SA, Gavin RH (2005) Myosin genes in Tetrahymena. Cell Motil Cytoskeleton 61: 237–243. Find this article online
  103. Wloga D, Camba A, Rogowski K, Manning G, Jerka-Dziadosz M, et al. (2006) Members of the Nima-related kinase family promote disassembly of cilia by multiple mechanisms. Mol Biol Cell 17: 2799–2810. Find this article online
  104. Global analysis of protein kinase genes in sequenced genomes Available: http://kinase.com. Accessed 15 July 2006.
  105. Manning G, Plowman GD, Hunter T, Sudarsanam S (2002) Evolution of protein kinase signaling from yeast to man. Trends Biochem Sci 27: 514–520. Find this article online
  106. Goldberg JM, Manning G, Liu A, Fey P, Pilcher KE, et al. (2006) The dictyostelium kinome—Analysis of the protein kinases from a simple model organism. PLoS Genet 2: e38. DOI: 10.1371/journal.pgen.0020038.
  107. Christensen ST, Guerra CF, Awan A, Wheatley DN, Satir P (2003) Insulin receptor-like proteins in Tetrahymena thermophila ciliary membranes. Curr Biol 13: R50–R52. Find this article online
  108. Manning G, Caenepeel S (2005) Protein kinases in human disease. 2005–06 Catalog and technical reference Beverly (Massachusetts): Cell Signaling Technologies. pp. 402–409.
  109. O’Connell MJ, Krien MJ, Hunter T (2003) Never say never. The NIMA-related protein kinases in mitotic control. Trends Cell Biol 13: 221–228. Find this article online
  110. Okazaki N, Yan J, Yuasa S, Ueno T, Kominami E, et al. (2000) Interaction of the Unc-51-like kinase and microtubule-associated protein light chain 3 related proteins in the brain: Possible role of vesicular transport in axonal elongation. Brain Res Mol Brain Res 85: 1–12. Find this article online
  111. Wolanin PM, Thomason PA, Stock JB (2002) Histidine protein kinases: Key signal transducers outside the animal kingdom. Genome Biol 3: reviews3013. Find this article online
  112. Hanks SK (2003) Genomic analysis of the eukaryotic protein kinase superfamily: A perspective. Genome Biol 4: 111. Find this article online
  113. Ren Q, Kang KH, Paulsen IT (2004) TransportDB: A relational database of cellular membrane transport systems. Nucleic Acids Res 32: D284–D288. Find this article online
  114. Haynes WJ, Ling KY, Saimi Y, Kung C (2003) PAK paradox: Paramecium appears to have more K(+)-channel genes than humans. Eukaryot Cell 2: 737–745. Find this article online
  115. Kung C, Saimi Y (1982) The physiological basis of taxes in Paramecium. Annu Rev Physiol 44: 519–534. Find this article online
  116. Hennessey T, Machemer H, Nelson DL (1985) Injected cyclic AMP increases ciliary beat frequency in conjunction with membrane hyperpolarization. Eur J Cell Biol 36: 153–156. Find this article online
  117. Weber JH, Vishnyakov A, Hambach K, Schultz A, Schultz JE, et al. (2004) Adenylyl cyclases from Plasmodium, Paramecium and Tetrahymena are novel ion channel/enzyme fusion proteins. Cell Signal 16: 115–125. Find this article online
  118. Puente XS, Sanchez LM, Overall CM, Lopez-Otin C (2003) Human and mouse proteases: A comparative genomic approach. Nat Rev Genet 4: 544–558. Find this article online
  119. Rawlings ND, Tolle DP, Barrett AJ (2004) MEROPS: The peptidase database. Nucleic Acids Res 32: D160–D164. Find this article online
  120. Southan C (2001) A genomic perspective on human proteases. FEBS Lett 498: 214–218. Find this article online
  121. Barrett AJ, Rawlings ND, Woessner JF (1998) Handbook of proteolytic enzymes San Diego: Academic Press. 1666 p.
  122. Wu Y, Wang X, Liu X, Wang Y (2003) Data-mining approaches reveal hidden families of proteases in the genome of malaria parasite. Genome Res 13: 601–616. Find this article online
  123. Bochtler M, Ditzel L, Groll M, Hartmann C, Huber R (1999) The proteasome. Annu Rev Biophys Biomol Struct 28: 295–317. Find this article online
  124. Gruszynski AE, DeMaster A, Hooper NM, Bangs JD (2003) Surface coat remodeling during differentiation of Trypanosoma brucei. J Biol Chem 278: 24665–24672. Find this article online
  125. LaCount DJ, Gruszynski AE, Grandgenett PM, Bangs JD, Donelson JE (2003) Expression and function of the Trypanosoma brucei major surface protease (GP63) genes. J Biol Chem 278: 24658–24664. Find this article online
  126. Yao C, Donelson JE, Wilson ME (2003) The major surface protease (MSP or GP63) of Leishmania sp. Biosynthesis, regulation of expression, and function. Mol Biochem Parasitol 132: 1–16. Find this article online
  127. Madeo F, Herker E, Maldener C, Wissing S, Lachelt S, et al. (2002) A caspase-related protease regulates apoptosis in yeast. Mol Cell 9: 911–917. Find this article online
  128. Frankel J (2000) Cell biology of Tetrahymena thermophila. Methods Cell Biol 62: 27–125. Find this article online
  129. Williams NE (2000) Preparation of cytoskeletal fractions from Tetrahymena thermophila. Methods Cell Biol 62: 441–447. Find this article online
  130. Dutcher SK (2003) Long-lost relatives reappear: Identification of new members of the tubulin superfamily. Curr Opin Microbiol 6: 634–640. Find this article online
  131. Gaertig J, Thatcher TH, McGrath KE, Callahan RC, Gorovsky MA (1993) Perspectives on tubulin isotype function and evolution based on the observation that Tetrahymena thermophila microtubules contain a single alpha- and beta-tubulin. Cell Motil Cytoskeleton 25: 243–253. Find this article online
  132. McGrath KE, Yu SM, Heruth DP, Kelly AA, Gorovsky MA (1994) Regulation and evolution of the single alpha-tubulin gene of the ciliate Tetrahymena thermophila. Cell Motil Cytoskeleton 27: 272–283. Find this article online
  133. Shang Y, Li B, Gorovsky MA (2002) Tetrahymena thermophila contains a conventional gamma-tubulin that is differentially required for the maintenance of different microtubule-organizing centers. J Cell Biol 158: 1195–1206. Find this article online
  134. Dupuis-Williams P, Fleury-Aubusson A, de Loubresse NG, Geoffroy H, Vayssie L, et al. (2002) Functional role of epsilon-tubulin in the assembly of the centriolar microtubule scaffold. J Cell Biol 158: 1183–1193. Find this article online
  135. Ruiz F, Dupuis-Williams P, Klotz C, Forquignon F, Bergdoll M, et al. (2004) Genetic evidence for interaction between eta- and beta-tubulins. Eukaryot Cell 3: 212–220. Find this article online
  136. Ruiz F, Krzywicka A, Klotz C, Keller A, Cohen J, et al. (2000) The SM19 gene, required for duplication of basal bodies in Paramecium, encodes a novel tubulin, eta-tubulin. Curr Biol 10: 1451–1454. Find this article online
  137. Duan J, Gorovsky MA (2002) Both carboxy-terminal tails of alpha- and beta-tubulin are essential, but either one will suffice. Curr Biol 12: 313–316. Find this article online
  138. Thazhath R, Liu C, Gaertig J (2002) Polyglycylation domain of beta-tubulin maintains axonemal architecture and affects cytokinesis in Tetrahymena. Nat Cell Biol 4: 256–259. Find this article online
  139. Xia L, Hai B, Gao Y, Burnette D, Thazhath R, et al. (2000) Polyglycylation of tubulin is essential and affects cell motility and division in Tetrahymena thermophila. J Cell Biol 149: 1097–1106. Find this article online
  140. Gibbons IR, Rowe AJ (1965) Dynein: A protein with adenosine triphosphatase activity from cilia. Science 149: 424–426. Find this article online
  141. Gibbons IR, Lee-Eiford A, Mocz G, Phillipson CA, Tang WJ, et al. (1987) Photosensitized cleavage of dynein heavy chains. Cleavage at the “V1 site” by irradiation at 365 nm in the presence of ATP and vanadate. J Biol Chem 262: 2780–2786. Find this article online
  142. King SM (2000) The dynein microtubule motor. Biochim Biophys Acta 1496: 60–75. Find this article online
  143. Sakato M, King SM (2004) Design and regulation of the AAA+ microtubule motor dynein. J Struct Biol 146: 58–71. Find this article online
  144. Asai DJ, Koonce MP (2001) The dynein heavy chain: Structure, mechanics and evolution. Trends Cell Biol 11: 196–202. Find this article online
  145. Asai DJ, Wilkes DE (2004) The dynein heavy chain family. J Eukaryot Microbiol 51: 23–29. Find this article online
  146. Sailaja G, Lincoln LM, Chen J, Asai DJ (2001) Evaluating the dynein heavy chain gene family in Tetrahymena. Methods Mol Biol 161: 17–27. Find this article online
  147. Xu W, Royalty MP, Zimmerman JR, Angus SP, Pennock DG (1999) The dynein heavy chain gene family in Tetrahymena thermophila. J Eukaryot Microbiol 46: 606–611. Find this article online
  148. Foth BJ, Goedecke MC, Soldati D (2006) New insights into myosin evolution and classification. Proc Natl Acad Sci U S A 103: 3681–3686. Find this article online
  149. Janke C, Rogowski K, Wloga D, Regnard C, Kajava AV, et al. (2005) Tubulin polyglutamylase enzymes are members of the TTL domain protein family. Science 308: 1758–1762. Find this article online
  150. Osmani SA, Engle DB, Doonan JH, Morris NR (1988) Spindle formation and chromatin condensation in cells blocked at interphase by mutation of a negative cell cycle control gene. Cell 52: 241–251. Find this article online
  151. Fry AM, Meraldi P, Nigg EA (1998) A centrosomal function for the human Nek2 protein kinase, a member of the NIMA family of cell cycle regulators. EMBO J 17: 470–481. Find this article online
  152. Mahjoub MR, Montpetit B, Zhao L, Finst RJ, Goh B, et al. (2002) The FA2 gene of Chlamydomonas encodes a NIMA family kinase with roles in cell cycle progression and microtubule severing during deflagellation. J Cell Sci 115: 1759–1768. Find this article online
  153. Turkewitz AP (2004) Out with a bang! Tetrahymena as a model system to study secretory granule biogenesis. Traffic 5: 63–68. Find this article online
  154. Bock JB, Matern HT, Peden AA, Scheller RH (2001) A genomic perspective on membrane compartment organization. Nature 409: 839–841. Find this article online
  155. Ackers JP, Dhir V, Field MC (2005) A bioinformatic analysis of the RAB genes of Trypanosoma brucei. Mol Biochem Parasitol 141: 89–97. Find this article online
  156. Stenmark H, Olkkonen VM (2001) The Rab GTPase family. Genome Biol 2: reviews3007. Find this article online
  157. Pereira-Leal JB, Seabra MC (2001) Evolution of the Rab family of small GTP-binding proteins. J Mol Biol 313: 889–901. Find this article online
  158. Saito-Nakano Y, Loftus BJ, Hall N, Nozaki T (2005) The diversity of Rab GTPases in Entamoeba histolytica. Exp Parasitol 110: 244–252. Find this article online
  159. Lal K, Field MC, Carlton JM, Warwicker J, Hirt RP (2005) Identification of a very large Rab GTPase family in the parasitic protozoan Trichomonas vaginalis. Mol Biochem Parasitol 143: 226–235. Find this article online
  160. Muller HM, Kenny EE, Sternberg PW (2004) Textpresso: An ontology-based information retrieval and extraction system for biological literature. PLoS Biol 2: e309. Find this article online
  161. Stein LD, Mungall C, Shu S, Caudy M, Mangone M, et al. (2002) The generic genome browser: A building block for a model organism system database. Genome Res 12: 1599–1610. Find this article online
  162. Dear PH, Cook PR (1993) Happy mapping: Linkage mapping using a physical analogue of meiosis. Nucleic Acids Res 21: 13–20. Find this article online
  163. Elliott AM, Gruchy DF (1952) The occurence of mating types in Tetrahymena. Biol Bull (Woods Hole, MA) 105: 301. Find this article online
  164. Mayo KA, Orias E (1981) Further evidence for lack of gene expression in the Tetrahymena micronucleus. Genetics 98: 747–762. Find this article online
  165. Allen SL, Gibson I (1973) Genetics of Tetrahymena. In: Elliott AM Biology of Tetrahymena Stroudsburg (Pennsylvania): Dowden, Hutchinson and Ross. pp. 307–373.
  166. Allen SL (1967) Genomic exclusion: A rapid means for inducing homozygous diploid lines in Tetrahymena pyriformis, syngen 1. Science 155: 575–577. Find this article online
  167. Ward N, Eisen J, Fraser C, Stackebrandt E (2001) Sequenced strains must be saved from extinction. Nature 414: 148. Find this article online
  168. Gorovsky MA, Yao MC, Keevert JB, Pleger GL (1975) Isolation of micro- and macronuclei of Tetrahymena pyriformis. Methods Cell Biol 9: 311–327. Find this article online
  169. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, et al. (2001) The sequence of the human genome. Science 291: 1304–1351. Find this article online
  170. Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, et al. (2004) Versatile and open software for comparing large genomes. Genome Biol 5: R12. Find this article online
  171. Lynch TJ, Brickner J, Nakano KJ, Orias E (1995) Genetic map of randomly amplified DNA polymorphisms closely linked to the mating type locus of Tetrahymena thermophila. Genetics 141: 1315–1325. Find this article online
  172. Hamilton E, Bruns P, Lin C, Merriam V, Orias E, et al. (2005) Genome-wide characterization of Tetrahymena thermophila chromosome breakage sites. I. Cloning and identification of functional sites. Genetics 170: 1611–1621. Find this article online
  173. Birren B, Lai E (1993) Pulsed field gel electrophoresis—A practical guide New York: Academic Press.
  174. Eddy SR, Durbin R (1994) RNA sequence analysis using covariance models. Nucleic Acids Res 22: 2079–2088. Find this article online
  175. Lowe TM, Eddy SR (1997) tRNAscan-SE: A program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res 25: 955–964. Find this article online
  176. Griffiths-Jones S, Bateman A, Marshall M, Khanna A, Eddy SR (2003) Rfam: An RNA family database. Nucleic Acids Res 31: 439–441. Find this article online
  177. Griffiths-Jones S, Moxon S, Marshall M, Khanna A, Eddy SR, et al. (2005) Rfam: Annotating noncoding RNAs in complete genomes. Nucleic Acids Res 33: D121–D124. Find this article online
  178. Weinberg Z, Ruzzo WL (2004) Exploiting conserved structure for faster annotation of noncoding RNAs without loss of accuracy. Bioinformatics 20((Suppl 1)): I334–I341. Find this article online
  179. Orum H, Nielsen H, Engberg J (1993) Sequence and proposed secondary structure of the Tetrahymena thermophila U3-snRNA. Nucleic Acids Res 21: 2511. Find this article online
  180. Weinberg Z, Ruzzo WL (2006) Sequence-based heuristics for faster annotation of noncoding RNA families. Bioinformatics 22: 35–39. Find this article online
  181. Majoros WH, Pertea M, Salzberg SL (2004) TigrScan and GlimmerHMM: Two open source ab initio eukaryotic gene-finders. Bioinformatics 20: 2878–2879. Find this article online
  182. Korf I (2004) Gene finding in novel genomes. BMC Bioinformatics 5: 59. Find this article online
  183. Matys V, Fricke E, Geffers R, Gossling E, Haubrock M, et al. (2003) TRANSFAC: Transcriptional regulation, from patterns to profiles. Nucleic Acids Res 31: 374–378. Find this article online
  184. Salzberg SL, Pertea M, Delcher AL, Gardner MJ, Tettelin H (1999) Interpolated Markov models for eukaryotic gene finding. Genomics 59: 24–31. Find this article online
  185. Edgar RC (2004) MUSCLE: A multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5: 113. Find this article online
  186. Howe K, Bateman A, Durbin R (2002) QuickTree: Building huge Neighbour-Joining trees of protein sequences. Bioinformatics 18: 1546–1547. Find this article online
  187. Scharfe C, Zaccaria P, Hoertnagel K, Jaksch M, Klopstock T, et al. (2000) MITOP, the mitochondrial proteome database: 2000 Update. Nucleic Acids Res 28: 155–158. Find this article online
  188. Scharfe C, Zaccaria P, Hoertnagel K, Jaksch M, Klopstock T, et al. (1999) MITOP: Database for mitochondria-related proteins, genes and diseases. Nucleic Acids Res 27: 153–155. Find this article online
  189. Benson G (1999) Tandem repeats finder: A program to analyze DNA sequences. Nucleic Acids Res 27: 573–580. Find this article online
  190. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, et al. (1997) Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402. Find this article online
  191. Gershan JA, Karrer KM (2000) A family of developmentally excised DNA elements in Tetrahymena is under selective pressure to maintain an open reading frame encoding an integrase-like protein. Nucleic Acids Res 28: 4105–4112. Find this article online
  192. Shao H, Tu Z (2001) Expanding the diversity of the IS630-Tc1-mariner superfamily: discovery of a unique DD37E transposon and reclassification of the DD37D and DD39D transposons. Genetics 159: 1103–1115. Find this article online
  193. Ogurtsov AY, Roytberg MA, Shabalina SA, Kondrashov AS (2002) OWEN: Aligning long collinear regions of genomes. Bioinformatics 18: 1703–1704. Find this article online
  194. Sonnhammer EL, Eddy SR, Birney E, Bateman A, Durbin R (1998) Pfam: Multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res 26: 320–322. Find this article online
  195. Krogh A, Larsson B, von Heijne G, Sonnhammer EL (2001) Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes. J Mol Biol 305: 567–580. Find this article online
  196. Ren Q, Paulsen IT (2005) Comparative analyses of fundamental differences in membrane transport capabilities in prokaryotes and eukaryotes. PLoS Comput Biol 1: e27. DOI: 10.1371/journal.pcbi.0010027.
  197. Baldauf SL (2003) The deep roots of eukaryotes. Science 300: 1703–1706. Find this article online

The power of open access III: posting another of my open access publications here (Wolbachia genome paper)

I am posting another of my Open Access papers here – this was one of my first OA papers – a paper reporting sequencing and analysis of the first genome of a Wolbachia strain. Citation is:

Wu M, Sun LV, Vamathevan J, Riegler M, Deboy R, et al. (2004) Phylogenomics of the Reproductive Parasite Wolbachia pipientis wMel: A Streamlined Genome Overrun by Mobile Genetic Elements. PLoS Biol 2(3): e69 doi:10.1371/journal.pbio.0020069

Phylogenomics of the Reproductive Parasite Wolbachia pipientis wMel: A Streamlined Genome Overrun by Mobile Genetic Elements

Martin Wu1, Ling V. Sun2, Jessica Vamathevan1, Markus Riegler3, Robert Deboy1, Jeremy C. Brownlie3, Elizabeth A. McGraw3, William Martin4, Christian Esser4, Nahal Ahmadinejad4, Christian Wiegand4, Ramana Madupu1, Maureen J. Beanan1, Lauren M. Brinkac1, Sean C. Daugherty1, A. Scott Durkin1, James F. Kolonay1, William C. Nelson1, Yasmin Mohamoud1, Perris Lee1, Kristi Berry1, M. Brook Young1, Teresa Utterback1, Janice Weidman1, William C. Nierman1, Ian T. Paulsen1, Karen E. Nelson1, Hervé Tettelin1, Scott L. O’Neill2,3, Jonathan A. Eisen1*

1 The Institute for Genomic Research, Rockville, Maryland, United States of America, 2 Department of Epidemiology and Public Health, Yale University School of Medicine, New Haven, Connecticut, United States of America, 3 Department of Zoology and Entomology, School of Life Sciences, The University of Queensland, St Lucia, Queensland, Australia, 4 Institut für Botanik III, Heinrich-Heine Universität, Düsseldorf, Germany

The complete sequence of the 1,267,782 bp genome of Wolbachia pipientis wMel, an obligate intracellular bacteria of Drosophila melanogaster, has been determined. Wolbachia, which are found in a variety of invertebrate species, are of great interest due to their diverse interactions with different hosts, which range from many forms of reproductive parasitism to mutualistic symbioses. Analysis of the wMel genome, in particular phylogenomic comparisons with other intracellular bacteria, has revealed many insights into the biology and evolution of wMel and Wolbachia in general. For example, the wMel genome is unique among sequenced obligate intracellular species in both being highly streamlined and containing very high levels of repetitive DNA and mobile DNA elements. This observation, coupled with multiple evolutionary reconstructions, suggests that natural selection is somewhat inefficient in wMel, most likely owing to the occurrence of repeated population bottlenecks. Genome analysis predicts many metabolic differences with the closely related Rickettsia species, including the presence of intact glycolysis and purine synthesis, which may compensate for an inability to obtain ATP directly from its host, as Rickettsia can. Other discoveries include the apparent inability of wMel to synthesize lipopolysaccharide and the presence of the most genes encoding proteins with ankyrin repeat domains of any prokaryotic genome yet sequenced. Despite the ability of wMel to infect the germline of its host, we find no evidence for either recent lateral gene transfer between wMel and D. melanogaster or older transfers between Wolbachia and any host. Evolutionary analysis further supports the hypothesis that mitochondria share a common ancestor with the α-Proteobacteria, but shows little support for the grouping of mitochondria with species in the order Rickettsiales. With the availability of the complete genomes of both species and excellent genetic tools for the host, the wMel–D. melanogaster symbiosis is now an ideal system for studying the biology and evolution of Wolbachia infections.
Academic Editor: Nancy A. Moran, University of Arizona
Citation: Wu M, Sun LV, Vamathevan J, Riegler M, Deboy R, et al. (2004) Phylogenomics of the Reproductive Parasite Wolbachia pipientis wMel: A Streamlined Genome Overrun by Mobile Genetic Elements. PLoS Biol 2(3): e69 doi:10.1371/journal.pbio.0020069
Received: November 19, 2003; Accepted: January 6, 2004; Published: March 16, 2004
Copyright: © 2004 Wu et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abbreviations: CDS, coding sequence; ENc, effective number of codons; IS, insertion sequence; LPS, lipopolysaccharide; RT, reverse transcription; TIGR, The Institute for Genomic Research
* To whom correspondence should be addressed. E-mail: jeisen@tigr.org

Introduction

Wolbachia are intracellular gram-negative bacteria that are found in association with a variety of invertebrate species, including insects, mites, spiders, terrestrial crustaceans, and nematodes. Wolbachia are transovarialy transmitted from females to their offspring and are extremely widespread, having been found to infect 20%–75% of invertebrate species sampled (Jeyaprakash and Hoy 2000; Werren and Windsor 2000). Wolbachia are members of the Rickettsiales order of the α-subdivision of the Proteobacteria phyla and belong to the Anaplasmataceae family, with members of the genera Anaplasma, Ehrlichia, Cowdria, and Neorickettsia (Dumler et al. 2001). Six major clades (A–F) of Wolbachia have been identified to date (Lo et al. 2002): A, B, E, and F have been reported from insects, arachnids, and crustaceans; C and D from filarial nematodes.

thumbnail

Figure 1. Circular Map of the Genome and Genome Features

Circles correspond to the following: (1) forward strand genes; (2) reverse strand genes, (3) in red, genes with likely orthologs in both R. conorii and R. prowazekii; in blue, genes with likely orthologs in R. prowazekii, but absent from R. conorii; in green, genes with likely orthologs in R. conorii but absent from R. prowazekii; in yellow, genes without orthologs in either Rickettsia (Table S3); (4) plot is of χ2 analysis of nucleotide composition; phage regions are in pink; (5) plot of GC skew (G–C)/(G+C); (6) repeats over 200 bp in length, colored by category; (7) in green, transfer RNAs; (8) in blue, ribosomal RNAs; in red, structural RNA.

Wolbachia–host interactions are complex and range from mutualistic to pathogenic, depending on the combination of host and Wolbachia involved. Most striking are the various forms of “reproductive parasitism” that serve to alter host reproduction in order to enhance the transmission of this maternally inherited agent. These include parthenogenesis (infected females reproducing in the absence of mating to produce infected female offspring), feminization (infected males being converted into functional phenotypic females), male-killing (infected male embryos being selectively killed), and cytoplasmic incompatibility (in its simplest form, the developmental arrest of offspring of uninfected females when mated to infected males) (O’Neill et al. 1997a).
Wolbachia have been hypothesized to play a role in host speciation through the reproductive isolation they generate in infected hosts (Werren 1998). They also provide an intriguing array of evolutionary solutions to the genetic conflict that arises from their uniparental inheritance. These solutions represent alternatives to classical mutualism and are often of more benefit to the symbiont than the host that is infected (Werren and O’Neill 1997). From an applied perspective, it has been proposed that Wolbachia could be utilized to either suppress pest insect populations or sweep desirable traits into pest populations (e.g., the inability to transmit disease-causing pathogens) (Sinkins and O’Neill 2000). Moreover, they may provide a new approach to the control of human and animal filariasis. Since the nematode worms that cause filariasis have an obligate symbiosis with mutualistic Wolbachia, treatment of filariasis with simple antibiotics that target Wolbachia has been shown to eliminate microfilaria production as well as ultimately killing the adult worm (Taylor et al. 2000; Taylor and Hoerauf 2001).
Despite their common occurrence and major effects on host biology, little is currently known about the molecular mechanisms that mediate the interactions between Wolbachia and their invertebrate hosts. This is partly due to the difficulty of working with an obligate intracellular organism that is difficult to culture and hard to obtain in quantity. Here we report the completion and analysis of the genome sequence of Wolbachia pipientis wMel, a strain from the A supergroup that naturally infects Drosophila melanogaster (Zhou et al. 1998).

thumbnail

Table 1. wMel Genome Features

Results/Discussion

Genome Properties

The wMel genome is determined to be a single circular molecule of 1,267,782 bp with a G+C content of 35.2%. This assembly is very similar to the genetic and physical map of the closely related strain wMelPop (Sun et al., 2003). The genome does not exhibit the GC skew pattern typical of some prokaryotic genomes (Figure 1) that have two major shifts, one near the origin and one near the terminus of replication. Therefore, identification of a putative origin of replication and the assignment of basepair 1 were based on the location of the dnaA gene. Major features of the genome and of the annotation are summarized in Table 1 and Figure 1.

Repetitive and Mobile DNA

The most striking feature of the wMel genome is the presence of very large amounts of repetitive DNA and DNA corresponding to mobile genetic elements, which is unique for an intracellular species. In total, 714 repeats of greater than 50 bp in length, which can be divided into 158 distinct families (Table S1), were identified. Most of the repeats are present in only two copies in the genome, although 39 are present in three or more copies, with the most abundant repeat being found in 89 copies. We focused our analysis on the 138 repeats of greater than 200 bp (Table 2). These were divided into 19 families based upon sequence similarity to each other. These repeats were found to make up 14.2 % of the wMel genome. Of these repeat families, 15 correspond to likely mobile elements, including seven types of insertion sequence (IS) elements, four likely retrotransposons, and four families without detectible similarity to known elements but with many hallmarks of mobile elements (flanked by inverted repeats, present in multiple copies) (Table 2). One of these new elements (repeat family 8) is present in 45 copies in the genome. It is likely that many of these elements are not able to autonomously transpose since many of the transposase genes are apparently inactivated by mutations or the insertion of other transposons (Table S2). However, some are apparently recently active since there are transposons inserted into at least nine genes (Table S2), and the copy number of some repeats appears to be variable between Wolbachia strains (M. Riegler et al., personal communication). Thus, many of these repetitive elements may be useful markers for strain discrimination. In addition, the mobile elements likely contribute to generating the diversity of phenotypically distinct Wolbachia strains (e.g., mod strains [McGraw et al. 2001]) by altering or disrupting gene function (Table S2).

thumbnail

Table 2. wMel DNA Repeats of Greater than 200 bp

Three prophage elements are present in the genome. One is a small pyocin-like element made up of nine genes (WD00565–WD00575). The other two are closely related to and exhibit extensive gene order conservation with the WO phage described from Wolbachia sp. wKue (Masui et al. 2001) (Figure 2). Thus, we have named them wMel WO-A and WO-B, based upon their location in the genome. wMel WO-B has undergone a major rearrangement and translocation, suggesting it is inactive. Phylogenetic analysis indicates that wMel WO-B is more closely related to the wKue WO than to wMel WO-A (Figure S1). Thus, wMel WO-A likely represents either a separate insertion event in the Wolbachia lineage or a duplication that occurred prior to the separation of the wMel and wKue lineages. Phylogenetic analysis also confirms the proposed mosaic nature of the WO phage (Masui et al. 2001), with one block being closely related to lambdoid phage and another to P2 phage (data not shown).

Genome Structure: Rearrangements, Duplications, and Deletions

The irregular pattern of GC skew in wMel is likely due in part to intragenomic rearrangements associated with the many DNA repeat elements. Comparison with a large contig from a Wolbachia species that infects Brugia malayi is consistent with this (Ware et al. 2002) (Figure 3). While only translocations are seen in this plot, genetic comparisons reveal that inversions also occur between strains (Sun et al., 2003), which is consistent with previous studies of prokaryotic genomes that have found that the most common large-scale rearrangements are inversions that are symmetric around the origin of DNA replication (Eisen et al. 2000). The occurrence of frequent rearrangement events during Wolbachia evolution is supported by the absence of any large-scale conserved gene order with Rickettsia genomes. The rearrangements in Wolbachia likely correspond with the introduction and massive expansion of the repeat element families that could serve as sites for intragenomic recombination, as has been shown to occur for some other bacterial species (Parkhill et al. 2003). The rearrangements in wMel may have fitness consequences since several classes of genes often found in clusters are generally scattered throughout the wMel genome (e.g., ABC transporter subunits, Sec secretion genes, rRNA genes, F-type ATPase genes).

thumbnail

Figure 2. Phage Alignments and Neighboring Genes

Conserved gene order between the WO phage in Wolbachia sp. wKue and prophage regions of wMel. Putative proteins in wKue (Masui et al. 2001) were searched using TBLASTN against the wMel genome. Matches with an E-value of less than 1e−15 are linked by connecting lines. CDSs are colored as follows: brown, phage structural or replication genes; light blue, conserved hypotheticals; red, hypotheticals; magenta, transposases or reverse transcriptases; blue, ankyrin repeat genes; light gray, radC; light green, paralogous genes; gold, others. The regions surrounding the phage are shown because they have some unusual features relative to the rest of the genome. For example, WO-A and WO-B are each flanked on one side by clusters of genes in two paralogous families that are distantly related to phage repressors. In each of these clusters, a homolog of the radC gene is found. A third radC homolog (WD1093) in the genome is also flanked by a member of one of these gene families (WD1095). While the connection between radC and the phage is unclear, the multiple copies of the radC gene and the members of these paralogous families may have contributed to the phage rearrangements described above.

Although the common ancestor of Wolbachia and Rickettsia likely already had a reduced, streamlined genome, wMel has lost additional genes since that time (Table S3). Many of these recent losses are of genes involved in cell envelope biogenesis in other species, including most of the machinery for producing lipopolysaccharide (LPS) components and the alanine racemase that supplies D-alanine for cell wall synthesis. In addition, some other genes that may have once been involved in this process are present in the genome, but defective (e.g., mannose-1-phosphate guanylyltransferase, which is split into two coding sequences [CDSs], WD1224 and WD1227, by an IS5 element) and are likely in the process of being eliminated. The loss of cell envelope biogenesis genes has also occurred during the evolution of the Buchnera endosymbionts of aphids (Shigenobu et al. 2000; Moran and Mira 2001). Thus, wMel and Buchnera have lost some of the same genes separately during their reductive evolution. Such convergence means that attempts to use gene content to infer evolutionary relatedness needs to be interpreted with caution. In addition, since Anaplasma and Ehrlichia also apparently lack genes for LPS production (Lin and Rikihisha 2003), it is likely that the common ancestor of Wolbachia, Ehrlichia, and Anaplasma was unable to synthesize LPS. Thus, the reports that Wolbachia-derived LPS-like compounds is involved in the immunopathology of filarial nematode disease in mammals (Taylor 2002) either indicate that these Wolbachia have acquired genes for LPS synthesis or that the reported LPS-like compounds are not homologous to LPS.

thumbnail

Figure 3. Alignment of wMel with a 60 kbp Region of the Wolbachia from B. malayi

The figure shows BLASTN matches (green) and whole-proteome alignments (red) that were generated using the “promer” option of the MUMmer software (Delcher et al. 1999). The B. malayi region is from a BAC clone (Ware et al. 2002). Note the regions of alignment broken up by many rearrangements and the presence of repetitive sequences at the regions of the breaks.

Despite evident genome reduction in wMel and in contrast to most small-genomed intracellular species, gene duplication appears to have continued, as over 50 gene families have apparently expanded in the wMel lineage relative to that of all other species (Table S4). Many of the pairs of duplicated genes are encoded next to each other in the genome, suggesting that they arose by tandem duplication events and may simply reflect transient duplications in evolution (deletion is common when there are tandem arrays of genes). Many others are components of mobile genetic elements, indicating that these elements have expanded significantly after entering the Wolbachia evolutionary lineage. Other duplications that could contribute to the unique biological properties of wMel include that of the mismatch repair gene mutL (see below) and that of many hypothetical and conserved hypothetical proteins.
One duplication of particular interest is that of wsp, which is a standard gene for strain identification and phylogenetic reconstruction in Wolbachia (Zhou et al. 1998). In addition to the previously described wsp (WD0159), wMel encodes two wsp paralogs (WD0009 and WD0489), which we designate as wspB and wspC, respectively. While these paralogs are highly divergent from wsp (protein identities of 19.7% and 23.5%, respectively) and do not amplify using the standard wsp PCR primers (Braig et al. 1998; Zhou et al. 1998), their presence could lead to some confusion in classification and identification of Wolbachia strains. This has apparently occurred in one study of Wolbachia strain wKueYO, for which the reported wsp gene (gbAB045235) is actually an ortholog of wspB (99.8% sequence identity and located at the end of the virB operon [Masui et al. 2000]) and not an ortholog of the wsp gene. Considering that the wsp gene has been extremely informative for discriminating between strains of Wolbachia, we designed PCR primers to the wMel wspB gene to amplify and then sequence the orthologs from the related wRi and wAlbB Wolbachia strains from Drosophila simulans and Aedes albopictus, respectively, as well as the Wolbachia strain that infects the filarial nematode Dirofilaria immitis to determine the potential utility of this locus for strain discrimination. A comparison of genetic distances between the wsp and wspB genes for these different taxa indicates that overall the wspB gene appears to be evolving at a faster rate than wsp and, as such, may be a useful additional marker for discriminating between closely related Wolbachia strains (Table S5).

Inefficiency of Selection in wMel

The fraction of the genome that is repetitive DNA and the fraction that corresponds to mobile genetic elements are among the highest for any prokaryotic genome. This is particularly striking compared to the genomes of other obligate intracellular species such as Buchnera, Rickettsia, Chlamydia, and Wigglesworthia, that all have very low levels of repetitive DNA and mobile elements. The recently sequenced genomes of the intracellular pathogen Coxiella burnetti (Seshadri et al. 2003) has both a streamlined genome and moderate amounts of repetitive DNA, although much less than wMel. The paucity of repetitive DNA in these and other intracellular species is thought to be due to a combination of lack of exposure to other species, thereby limiting introduction of mobile elements, and genome streamlining (Mira et al. 2001; Moran and Mira 2001; Frank et al. 2002). We examined the wMel genome to try to understand the origin of the repetitive and mobile DNA and to explain why such repetitive/mobile DNA is present in wMel, but not other streamlined intracellular species.
We propose that the mobile DNA in wMel was acquired some time after the separation of the Wolbachia and Rickettsia lineages but before the radiation of the Wolbachia group. The acquisition of these elements after the separation of the Wolbachia and Rickettsia lineages is suggested by the fact that most do not have any obvious homologous sequences in the genomes of other α-Proteobacteria, including the closely related Rickettsia spp. Additional evidence for some acqui-sition of foreign DNA after the Wolbachia–Rickettsia split comes from phylogenetic analysis of those genes present in wMel, but not in the two sequenced rickettsial genomes (see Table S3; unpublished data). The acquisition prior to the radiation of Wolbachia is suggested by two lines of evidence. First, many of the elements are found in the genome of the distantly related Wolbachia of the nematode B. malayi (see Figure 3; unpublished data). In addition, genome analysis reveals that these elements do not have significantly anomalous nucleotide composition or codon usage compared to the rest of the genome. In fact, there are only four regions of the genome with significantly anomalous composition, comprising in total only approximately 17 kbp of DNA (Table 3). The lack of anomalous composition suggests either that any foreign DNA in wMel was acquired long enough ago to allow it to “ameliorate” and become compositionally similar to endogenous Wolbachia DNA (Lawrence and Ochman 1997, 1998) or that any foreign DNA that is present was acquired from organisms with similar composition to endogenous wMel genes. Owing to their potential effects on genome evolution (insertional mutagenesis, catalyzing genome rearrangements), we propose that the acquisition and maintenance of these repetitive and mobile elements by wMel have played a key role in shaping the evolution of Wolbachia.

thumbnail

Table 3. Regions of Anomalous Nucleotide Composition in the wMel Genome

It is likely that much of the mobile/repetitive DNA was introduced via phage, given that three prophage elements are present; experimental studies have shown active phage in some Wolbachia (Masui et al. 2001) and Wolbachia superinfections occur in many hosts (e.g., Jamnongluk et al. 2002), which would allow phage to move between strains. Whatever the mechanism of introduction, the persistence of the repetitive elements in wMel in the face of apparently strong pressures for streamlining is intriguing. One expla-nation is that wMel may be getting a steady infusion of mobile elements from other Wolbachia strains to counteract the elimination of elements by selection for genome streamlining. This would explain the absence of anomalous nucleotide composition of the elements. However, we believe that a major contributing factor to the presence of all the repetitive/mobile DNA in wMel is that wMel and possibly Wolbachia in general have general inefficiency of natural selection relative to other species. This inefficiency would limit the ability to eliminate repetitive DNA. A general inefficiency of natural selection (especially purifying selection) has been suggested previously for intracellular bacteria, based in part on observations that these bacteria have higher evolutionary rates than free-living bacteria (e.g., Moran 1996). We also find a higher evolutionary rate for wMel than that of the closely related intracellular Rickettsia, which themselves have higher rates than free-living α-Proteobacteria (Figure 4). Additionally, codon bias in wMel appears to be driven more by mutation or drift than selection (Figure S2), as has been reported for Buchnera species and was suggested to be due to inefficient purifying selection (Wernegreen and Moran 1999). Such inefficiencies of natural selection are generally due to an increase in the relative contribution of genetic drift and mutation as compared to natural selection (Eiglmeier et al. 2001; Lawrence 2001; Parkhill et al. 2001). Below we discuss different possible explanations for the inefficiency of selection in wMel, especially in comparison to other intracellular bacteria.

thumbnail

Figure 4. Long Evolutionary Branches in wMel

Maximum-likelihood phylogenetic tree constructed on concatenated protein sequences of 285 orthologs shared among wMel, R. prowazekii, R. conorii, C. crescentus, and E. coli. The location of the most recent common ancestor of the α-Proteobacteria (Caulobacter, Rickettsia, Wolbachia) is defined by the outgroup E. coli. The unit of branch length is the number of changes per amino acid. Overall, the amino acid substitution rate in the wMel lineage is about 63% higher than that of C. crescentus, a free-living α-Proteobacteria. wMel has evolved at a slightly higher rate than the Rickettssia spp., close relatives that are also obligate intracellular bacteria that have undergone accelerated evolution themselves. This higher rate is likely in part to be due to an increase in the rate of slightly deleterious mutations, although we have not ruled out the possibility of G+C content effects on the branch lengths.

Low rates of recombination, such as occur in centromeres and the human Y chromosome, can lead to inefficient selection because of the linkage among genes. This has been suggested to be occurring in Buchnera species because these species do not encode homologs of RecA, which is the key protein in homologous recombination in most species (Shigenobu et al. 2000). The absence of recombination in Buchnera is supported by the lack of genome rearrangements in their recent evolution (Tamas et al. 2002). Additionally, there is apparently little or no gene flow into Buchnera strains. In contrast, wMel encodes the necessary machinery for recombination, including RecA (Table S6), and has experienced both extensive intragenomic homologous recombination and introduction of foreign DNA. Therefore, the unusual genome features of wMel are unlikely to be due to low levels of recombination.
Another possible explanation for inefficient selection is high mutation rates. It has been suggested that the higher evolutionary rates in intracellular bacteria are the result of high mutation rates that are in turn due to the loss of genes for DNA repair processes (e.g., Itoh et al. 2002). This is likely not the case in wMel since its genome encodes proteins corresponding to a broad suite of DNA repair pathways including mismatch repair, nucleotide excision repair, base excision repair, and homologous recombination (Table S6). The only noteworthy DNA repair gene absent from wMel and present in the more slowly evolving Rickettsia is mfd, which is involved in targeting DNA repair to the transcribed strand of actively transcribing genes in other species (Selby et al. 1991). However, this absence is unlikely to contribute significantly to the increased evolutionary rate in wMel, since defects in mfd do not lead to large increases in mutation rates in other species (Witkin 1994). The presence of mismatch repair genes (homologs of mutS and mutL) in wMel is particularly relevant since this pathway is one of the key steps in regulating mutation rates in other species. In fact, wMel is the first bacterial species to be found with two mutL homologs. Overall, examination of the predicted DNA repair capabilities of bacteria (Eisen and Hanawalt 1999) suggests that the connection between evolutionary rates in intracellular species and the loss of DNA repair processes is spurious. While many intracellular species have lost DNA repair genes in their recent evolution, different species have lost different genes and some, such as wMel and Buchnera spp., have kept the genes that likely regulate mutation rates. In addition, some free-living species without high evolutionary rates have lost some of the same pathways lost in intracellular species, while many free-living species have lost key pathways resulting in high mutation rates (e.g., Helicobacter pylori has apparently lost mismatch repair [Eisen 1997, Eisen 1998b; Bjorkholm et al. 2001]). Given that intracellular species tend to have small genomes and have lost genes from every type of biological process, it is not surprising that many of them have lost DNA repair genes as well.
We believe that the most likely explanations for the inefficiency of selection in wMel involve population-size related factors, such as genetic drift and the occurrence of population bottlenecks. Such factors have also been shown to likely explain the high evolutionary rates in other intracellular species (Moran 1996; Moran and Mira 2001; van Ham et al. 2003). Wolbachia likely experience frequent population bottlenecks both during transovarial transmission (Boyle et al. 1993) and during cytoplasmic incompatibility mediated sweeps through host populations. The extent of these bottlenecks may be greater than in other intracellular bacteria, which would explain why wMel has both more repetitive and mobile DNA than other such species and a higher evolutionary rate than even the related Rickettsia spp. Additional genome sequences from other Wolbachia will reveal whether this is a feature of all Wolbachia or only certain strains.

Mitochondrial Evolution

There is a general consensus in the evolutionary biology literature that the mitochondria evolved from bacteria in the α-subgroup of the Proteobacteria phyla (e.g., Lang et al. 1999). Analysis of complete mitochondrial and bacterial genomes has very strongly supported this hypothesis (Andersson et al. 1998, 2003; Muller and Martin 1999; Ogata et al. 2001). However, the exact position of the mitochondria within the α-Proteobacteria is still debated. Many studies have placed them in or near the Rickettsiales order (Viale and Arakaki 1994; Gupta 1995; Sicheritz-Ponten et al. 1998; Lang et al. 1999; Bazinet and Rollins 2003). Some studies have further suggested that mitochondria are a sister taxa to the Rickettsia genus within the Rickettsiaceae family and thus more closely related to Rickettsia spp. than to species in the Anaplasmataceae family such as Wolbachia (Karlin and Brocchieri 2000; Emelyanov 2001a, 2001b, 2003a, 2003b).
In our analysis of complete genomes, including that of wMel, the first non-Rickettsia member of the Rickettsiales order to have its genome completed, we find support for a grouping of Wolbachia and Rickettsia to the exclusion of the mitochondria, but not for placing the mitochondria within the Rickettsiales order (Figure 5A and 5B; Table S7; Table S8). Specifically, phylogenetic trees of a concatenated alignment of 32 proteins show strong support with all methods (see Table S7) for common branching of: (i) mitochondria, (ii) Rickettsia with Wolbachia, (iii) the free-living α-Proteobacteria, and (iv) mitochondria within α-Proteobacteria. Since amino acid content bias was very severe in these datasets, protein LogDet analyses, which can correct for the bias, were also performed. In LogDet analyses of the concatenated protein alignment, both including and excluding highly biased positions, mitochondria usually branched basal to the Wolbachia–Rickettsia clade, but never specifically with Rickettsia (see Table S7). In addition, in phylogenetic studies of individual genes, there was no consistent phylogenetic position of mitochondrial proteins with any particular species or group within the α-Proteobacteria (see Table S8), although support for a specific branch uniting the two Rickettsia species with Wolbachia was quite strong. Eight of the proteins from mitochondrial genomes (YejW, SecY, Rps8, Rps2, Rps10, RpoA, Rpl15, Rpl32) do not even branch within the α-Proteobacteria, although these genes almost certainly were encoded in the ancestral mitochondrial genome (Lang et al. 1997).
This analysis of mitochondrial and α-Proteobacterial genes reinforces the view that ancient protein phylogenies are inherently prone to error, most likely because current models of phylogenetic inference do not accurately reflect the true evolutionary processes underlying the differences observed in contemporary amino acid sequences (Penny et al. 2001). These conflicting results regarding the precise position of mitochondria within the α-Proteobacteria can be seen in the high amount of networking in the Neighbor-Net graph of the analyses of the concatenated alignment shown in Figure 5. An important complication in studies of mitochondrial evolution lies in identifying “α-Proteobacterial” genes for comparison (Martin 1999). For example, in our analyses, proteins from Magnetococcus branched with other α-Proteobacterial homologs in only 17 of the 49 proteins studied, and in five cases they assumed a position basal to α-, β-, and γ-Proteobacterial homologs.

Host–Symbiont Gene Transfers

Many genes that were once encoded in mitochondrial genomes have been transferred into the host nuclear genomes. Searching for such genes has been complicated by the fact that many of the transfer events happened early in eukaryotic evolution and that there are frequently extreme amino acid and nucleotide composition biases in mitochondrial genomes (see above). We used the wMel genome to search for additional possible mitochondrial-derived genes in eukaryotic nuclear genomes. Specifically, we constructed phylogenetic trees for wMel genes that are not in either Rickettsia genomes. Five new eukaryotic genes of possible mitochondrial origin were identified: three genes involved in de novo nucleotide biosynthesis (purD, purM, pyrD) and two conserved hypothetical proteins (WD1005, WD0724). The α-Proteobacterial origin of these genes suggests that at least some of the genes of the de novo nucleotide synthesis pathway in eukaryotes might have been laterally acquired from bacteria via the mitochondria. The presence of such genes in other Proteobacteria suggests that their absence from Rickettsia is due to gene loss (Gray et al. 2001). This finding supports the need for additional α-Proteobacterial genomes to identify mitochondrion-derived genes in eukaryotes.

thumbnail

Figure 5. Mitochondrial Evolution Using Concatenated Alignments

Networks of protein LogDet distances for an alignment of 32 proteins constructed with Neighbor-Net (Bryant and Moulton 2003). The scale bar indicates 0.1 substitutions per site. Enlargements at lower right show the component of shared similarity between mitochondrial-encoded proteins and (i) their homologs from intracellular endosymbionts (red) as well as (ii) their homologs from free-living α-Proteobacteria (blue). (A) Result using 6,776 gap-free sites per genome (heavily biased in amino acid composition). (B) Result using 3,100 sites after exclusion of highly variable positions (data not biased in amino acid composition at p = 0.95). All data and alignments are available upon request. Results of phylogenetic analyses are summa-rized in Table S7. Since amino acid content bias was very severe in these datasets, protein LogDet analyses were also preformed. In neighbor-joining, parsimony, and maximum-likelihood trees generated from alignments both including and excluding highly biased positions (6,776 and 3,100 gap-free amino acid sites per genome, respectively), mitochondria usually branched basal to the Wolbachia–Rickettsia clade, but never specifically with Rickettsia (Table S7).

While organelle to nuclear gene transfers are generally accepted, there is a great deal of controversy over whether other gene transfers have occurred from bacteria into animals. In particular, claims of transfer from bacteria into the human genome (Lander et al. 2001) were later shown to be false (Roelofs and Van Haastert 2001; Salzberg et al. 2001; Stanhope et al. 2001). Wolbachia are excellent candidates for such transfer events since they live inside the germ cells, which would allow lateral transfers to the host to be transmitted to subsequent host generations. Consistent with this, a recent study has shown some evidence for the presence of Wolbachia-like genes in a beetle genome (Kondo et al. 2002). The symbiosis between wMel and D. melanogaster provides an ideal case to search for such transfers since we have the complete genomes of both the host and symbiont. Using BLASTN searches and MUMmer alignments, we did not find any examples of highly similar stretches of DNA shared between the two species. In addition, protein-level searches and phylogenetic trees did not identify any specific relationships between wMel and D. melanogaster for any genes. Thus, at least for this host–symbiont association, we do not find any likely cases of recent gene exchange, with genes being maintained in both host and symbiont. In addition, in our phylogenetic analyses, we did not find any examples of wMel proteins branching specifically with proteins from any invertebrate to the exclusion of other eukaryotes. Therefore, at least for the genes in wMel, we do not find evidence for transfer of Wolbachia genes into any invertebrate genome.

Metabolism and Transport

wMel is predicted to have very limited capabilities for membrane transport, for substrate utilization, and for the biosynthesis of metabolic intermediates (Figure S3), similar to what has been seen in other intracellular symbionts and pathogens (Paulsen et al. 2000). Almost all of the identifiable uptake systems for organic nutrients in wMel are for amino acids, including predicted transporters for proline, asparate/glutamate, and alanine. This pattern of transporters, coupled with the presence of pathways for the metabolism of the amino acids cysteine, glutamate, glutamine, proline, serine, and threonine, suggests that wMel may obtain much of its energy from amino acids. These amino acids could also serve as material for the production of other amino acids. In contrast, carbohydrate metabolism in wMel appears to be limited. The only pathways that appear to be complete are the tricarboxylic acid cycle, the nonoxidative pentose phosphate pathway, and glycolysis, starting with fructose-1,6-biphosphate. The limited carbohydrate metabolism is consistent with the presence of only one sugar phosphate transporter. wMel can also apparently transport a range of inorganic ions, although two of these systems, for potassium uptake and sodium ion/proton exchange, are frameshifted. In the latter case, two other sodium ion/proton exchangers may be able to compensate for this defect.
Many of the predicted metabolic properties of wMel, such as the focus on amino acid transport and the presence of limited carbohydrate metabolism, are similar to those found in Rickettsia. A major difference with the Rickettsia spp. is the absence of the ADP–ATP exchanger protein in wMel. In Rickettsia this protein is used to import ATP from the host, thus allowing these species to be direct energy scavengers (Andersson et al. 1998). This likely explains the presence of glycolysis in wMel but not Rickettsia. An inability to obtain ATP from its host also helps explain the presence of pathways for the synthesis of the purines AMP, IMP, XMP, and GMP in wMel but not Rickettsia. Other pathways present in wMel but not Rickettsia include threonine degradation (described above), riboflavin biosynthesis, pyrimidine metabolism (i.e., from PRPP to UMP), and chelated iron uptake (using a single ABC transporter). The two Rickettsia species have a relatively large complement of predicted transporters for osmoprotectants, such as proline and glycine betaine, whereas wMel possesses only two of these systems.

Regulatory Responses

The wMel genome is predicted to encode few proteins for regulatory responses. Three genes encoding two-component system subunits are present: two sensor histidine kinases (WD1216 and WD1284) and one response regulator (WD0221). Only six strong candidates for transcription regulators were identified: a homolog of arginine repressors (WD0453), two members of the TenA family of transcription activator proteins (WD0139 and WD0140), a homolog of ctrA, a transcription regulator for two component systems in other α-Proteobacteria (WD0732), and two σ factors (RpoH/WD1064 and RpoD/WD1298). There are also seven members of one paralogous family of proteins that are distantly related to phage repressors (see above), although if they have any role in transcription, it is likely only for phage genes. Such a limited repertoire of regulatory systems has also been reported in other endosymbionts and has been explained by the apparent highly predictable and stable environment in which these species live (Andersson et al. 1998; Read et al. 2000; Shigenobu et al. 2000; Moran and Mira 2001; Akman et al. 2002; Seshadri et al. 2003).

Host–Symbiont Interactions

The mechanisms by which Wolbachia infect host cells and by which they cause the diverse phenotypic effects on host reproduction and fitness are poorly understood, and the wMel genome helps identify potential contributing factors. A complete Type IV secretion system, portions of which have been reported in earlier studies, is present. The complete genome sequence shows that in addition to the five vir genes previously described from Wolbachia wKueYO (Masui et al. 2001), an additional four are present in wMel. Of the nine wMel vir ORFs, eight are arranged into two separate operons. Similar to the single operon identified in wTai and wKueYO, the wMel virB8, virB9, virB10, virB11, and virD4 CDSs are adjacent to wspB, forming a 7 kb operon (WD0004–WD0009). The second operon contains virB3, virB4, and virB6 as well as four additional non-vir CDSs, including three putative membrane-spanning proteins, that form part of a 15.7 kb operon (WD0859–WD0853). Examination of the Rickettsia conorii genome shows a similar orga-nization (Figure 6A). The observed conserved gene order for these genes between these two genomes suggests that the putative membrane-spanning proteins could form a novel and, possibly, integral part of a functioning Type IV secretion system within these bacteria. Moreover, reverse transcription (RT)-PCRs have confirmed that wspB and WD0853–WD0856 are each expressed as part of the two vir operons and further indicate that these additional encoded proteins are novel components of the Wolbachia Type IV secretion system (Figure 6B).
In addition to the two major vir clusters, a paralog of virB8 (WD0817) is also present in the wMel genome. WD0818 is quite divergent from virB8 and, as such, does not appear to have resulted from a recent gene duplication event. RT-PCR experiments have failed to show expression of this CDS in wMel-infected Drosophila (data not shown). PCR primers were designed to all CDSs of the wMel Type IV secretion system and used to successfully amplify orthologs from the divergent Wolbachia strains wRi and wAlbB (data not shown). We were able to detect orthologs to all of the wMel Type IV secretion system components as well as most of the adjacent non-vir CDSs, suggesting that this system is conserved across a range of A- and B-group Wolbachia. An increasing body of evidence has highlighted the importance of Type IV secretion systems for the successful infection, invasion, and persistence of intracellular bacteria within their hosts (Christie 2001; Sexton and Vogel 2002). It is likely that the Type IV system in Wolbachia plays a role in the establishment and maintenance of infection and possibly in the generation of reproductive phenotypes.
Genes involved in pathogenicity in bacteria have been found to be frequently associated with regions of anomalous nucleotide composition, possibly owing to transfer from other species or insertion into the genome from plasmids or phage. In the four such regions in wMel (see above; see Table 3), some additional candidates for pathogenicity-related activities are present including a putative penicillin-binding protein (WD0719), genes predicted to be involved in cell wall synthesis (WD0095–WD0098, including D-alanine-D-alanine ligase, a putative FtsQ, and D-alanyl-D-alanine carboxy peptidase) and a multidrug resistance protein (WD0099). In addition, we have identified a cluster of genes in one of the phage regions that may also have some role in host–symbiont interactions. This cluster (WD0611–WD0621) is embedded within the WO-B phage region of the genome (see Figure 2) and contains many genes that encode proteins with putative roles in the synthesis and degradation of surface polysaccharides, including a UDP-glucose 6-dehydrogenase (WD0620). Since this cluster appears to be normal in terms of phylogeny relative to other genes in the genome (i.e., the genes in this region have normal wMel nucleotide composition and branch in phylogenetic trees with genes from other α-Proteobacteria), it is not likely to have been acquired from other species. However, it is possible that these genes can be transferred among Wolbachia strains via the phage, which in turn could lead to some variation in host–symbiont interactions between Wolbachia strains.

thumbnail

Figure 6. Genomic Organization and expression of Type IV Secretion Operons in wMel

(A) Organization of the nine vir-like CDSs (white arrows) and five adjacent CDSs that encode for either putative membrane-spanning proteins (black arrows) or non-vir CDSs (gray arrows) of wMel, R. conorii, and A. tumefaciens. Solid horizontal lines denote RT experiments that have confirmed that adjacent CDSs are expressed as part of a polycistronic transcript. Results of these RT-PCR experiments are presented in (B). Lane 1, virB3virB4; lane 2, RT control; lane 3, virB6-WD0856; lane 4, RT control; lane 5, WD0856-WD0855; lane 6, RT control; lane 7, WD0854-WD0853; lane 8, RT control; lane 9, virB8virB9; lane 10, RT control; lane 11, virB9virB11; lane 12, RT control; lane 13, virB11virD4; lane 14, RT control; lane 15, virD4wspB; lane 16, RT control; lane 17, virB4virB6; lane 18, RT control; lane 19, WD0855-WD0854; lane 20, RT control. Only PCRs that contain reverse transcriptase amplified the desired products. PCR primer sequences are listed in Table S9.

Of particular interest for host-interaction functions are the large number of genes that encode proteins that contain ankyrin repeats (Table 4). Ankyrin repeats, a tandem motif of around 33 amino acids, are found mainly in eukaryotic proteins, where they are known to mediate protein–protein interactions (Caturegli et al. 2000). While they have been found in bacteria before, they are usually present in only a few copies per species. wMel has 23 ankyrin repeat-containing genes, the most currently described for a prokaryote, with C. burnetti being next with 13. This is particularly striking given wMel’s relatively small genome size. The functions of the ankyrin repeat-containing proteins in wMel are difficult to predict since most have no sequence similarity outside the ankyrin domains to any proteins of known function. Many lines of evidence suggest that the wMel ankyrin domain proteins are involved in regulating host cell-cycle or cell division or interacting with the host cytoskeleton: (i) many ankyrin-containing proteins in eukaryotes are thought to be involved in linking membrane proteins to the cytoskeleton (Hryniewicz-Jankowska et al. 2002); (ii) an ankyrin-repeat protein of Ehrlichia phagocytophila binds condensed chromatin of host cells and may be involved in host cell-cycle regulation (Caturegli et al. 2000); (iii) some of the proteins that modify the activity of cell-cycle-regulating proteins in D. melanogaster contain ankyrin repeats (Elfring et al. 1997); and (iv) the Wolbachia strain that infects the wasp Nasonia vitripennis induces cytoplasmic incompatibility, likely by interacting with these same cell-cycle proteins (Tram and Sullivan 2002). Of the ankyrin-containing proteins in wMel, those worth exploring in more detail include the several that are predicted to be surface targeted or secreted (Table 4) and thus could be targeted to the host nucleus. It is also possible that some of the other ankyrin-containing proteins are secreted via the Type IV secretion system in a targeting signal independent pathway. We call particular attention to three of the ankyrin-containing proteins (WD0285, WD0636, and WD0637), which are among the very few genes, other than those encoding components of the translation apparatus, that have significantly biased codon usage relative to what is expected based on GC content, suggesting they may be highly expressed.

Conclusions

Analysis of the wMel genome reveals that it is unique among sequenced genomes of intracellular organisms in that it is both streamlined and massively infected with mobile genetic elements. The persistence of these elements in the genome for apparently long periods of time suggests that wMel is inefficient at getting rid of them, likely a result of experiencing severe population bottlenecks during every cycle of transovarial transmission as well as during sweeps through host populations. Integration of evolutionary reconstructions and genome analysis (phylogenomics) has provided insights into the biology of Wolbachia, helped identify genes that likely play roles in the unusual effects Wolbachia have on their host, and revealed many new details about the evolution of Wolbachia and mitochondria. Perhaps most importantly, future studies of Wolbachia will benefit both from this genome sequence and from the ability to study host–symbiont interactions in a host (D. melanogaster) well-suited for experimental studies.

Materials and Methods

Purification/source of DNA wMel DNA was obtained from D. melanogaster yw67c23 flies that naturally carry the wMel infection. wMel was purified from young adult flies on pulsed-field gels as described previously (Sun et al. 2001). Plugs were digested with the restriction enzyme AscI (GG^CGCGCC), which cuts the bacterial chromosome twice (Sun et al. 2001), aiding in the entry of the DNA into agarose gels. After electrophoresis, the resulting two bands were recovered from the gel and stored in 0.5 M EDTA (pH 8.0). DNA was extracted from the gel slices by first washing in TE (Tris–HCl and EDTA) buffer six times for 30 min each to dilute EDTA followed by two 1-h washes in β-agarase buffer (New England Biolabs, Beverly, Massachusetts, United States). Buffer was then removed and the blocks melted at 70°C for 7 min. The molten agarose was cooled to 40°C and then incubated in β-agarase (1 U/100 μl of molten agarose) for 1 h. The digest was cooled to 4°C for 1 h and then centrifuged at 4,100 × gmax for 30 min at 4°C to remove undigested agarose. The supernatant was concentrated on a Centricon YM-100 microconcentrator (Millipore, Bedford, Massachusetts, United States) after prerinsing with 70% ethanol followed by TE buffer and, after concentration, rinsed with TE. The retentate was incubated with proteinase K at 56°C for 2 h and then stored at 4°C. wMel DNA for gap closure was prepared from approximately 1,000 Drosophila adults using the Holmes–Bonner urea/phenol:chloroform protocol (Holmes and Bonner 1973) to prepare total fly DNA.
Library construction/sequencing/closure The complete genome sequence was determined using the whole-genome shotgun method (Venter et al. 1996). For the random shotgun-sequencing phase, libraries of average size 1.5–2.0 kb and 4.0–8.0 kb were used. After assembly using the TIGR Assembler (Sutton et al. 1995), there were 78 contigs greater than 5000 bp, 186 contigs greater than 3000 bp, and 373 contigs greater than 1500 bp. This number of contigs was unusually high for a 1.27 Mb genome. An initial screen using BLASTN searches against the nonredundant database in GenBank and the Berkeley Drosophila Genome Project site (http://www.fruitfly.org/blast/) showed that 3,912 of the 10,642 contigs were likely contaminants from the Drosophila genome. To aid in closure, the assemblies were rerun with all sequences of likely host origin excluded. Closure, which was made very difficult by the presence of a large amount of repetitive DNA (see below), was done using a mix of primer walking, generation, and sequencing of transposon-tagged libraries of large insert clones and multiplex PCR (Tettelin et al. 1999). The final sequence showed little evidence for polymorphism within the population of Wolbachia DNA. In addition, to obtain sequence across the AscI-cut sites, PCR was performed on undigested DNA. It is important to point out that the reason significant host contamination does not significantly affect symbiont genome assembly is that most of the Drosophila contigs were small due to the approximately 100-fold difference in genome sizes between host (approximately 180 Mb) and wMel (1.2 Mb).
Since it has been suggested that Wolbachia and their hosts may undergo lateral gene transfer events (Kondo et al. 2002), genome assemblies were rerun using all of the shotgun and closure reads without excluding any sequences that appeared to be of host origin. Only five assemblies were found to match both the D. melanogaster genome and the wMel assembly. Primers were designed to match these assemblies and PCR attempted from total DNA of wMel infected D. melanogaster. In each case, PCR was unsuccessful, and we therefore presume that these assemblies are the result of chimeric cloning artifacts. The complete sequence has been given GenBank accession ID AE017196 and is available at http://www.tigr.org/tdb.
Repeats Repeats were identified using RepeatFinder (Volfovsky et al. 2001), which makes use of the REPuter algorithm (Kurtz and Schleiermacher 1999) to find maximal-length repeats. Some manual curation and BLASTN and BLASTX searches were used to divide repeat families into different classes.
Annotation Identification of putative protein-encoding genes and annotation of the genome was done as described previously (Eisen et al. 2002). An initial set of ORFs likely to encode proteins (CDS) was identified with GLIMMER (Salzberg et al. 1998). Putative proteins encoded by the CDS were examined to identify frameshifts or premature stop codons compared to other species. The sequence traces for each were reexamined and, for some, new sequences were generated. Those for which the frameshift or premature stops were of high quality were annotated as “authentic” mutations. Functional assignment, identification of membrane-spanning domains, determination of paralogous gene families, and identification of regions of unusual nucleotide composition were performed as described previously (Tettelin et al. 2001). Phylogenomic analysis (Eisen 1998a; Eisen and Fraser 2003) was used to aid in functional predictions. Alignments and phylogenetic trees were generated as described (Salzberg et al. 2001).
Comparative genomics All putative wMel proteins were searched using BLASTP against the predicted proteomes of published complete organismal genomes and a set of complete plastid, mitochondrial, plasmid, and viral genomes. The results of these searches were used (i) to analyze the phylogenetic profile (Pellegrini et al. 1999; Eisen and Wu 2002), (ii) to identify putative lineage-specific duplications (those proteins with a top E-value score to another protein from wMel), and (iii) to determine the presence of homologs in different species. Orthologs between the wMel genome and that of the two Rickettsia species were identified by requiring mutual best-hit relationships among all possible pairwise BLASTP comparisons, with some manual correction. Those genes present in both Rickettsia genomes as well as other bacterial species, but not wMel, were considered to have been lost in the wMel branch (see Table S3). Genes present in only one or two of the three species were considered candidates for gene loss or lateral transfer and were also used to identify possible biological differences between these species (see Table S3). For the wMel genes not in the Rickettsia genomes, proteins were searched with BLASTP against the TIGR NRAA database. Protein sequences of their homologs were aligned with CLUSTALW and manually curated. Neighbor-joining trees were constructed using the PHYLIP package.
Phylogenetic analysis of mitochondrial proteins For phylogenetic analysis, the set of all 38 proteins encoded in both the Marchantia polymorpha and Reclinomonas americana (Lang et al. 1997) mitochondrial genomes were collected. Acanthamoeba castellanii was excluded due to high divergence and extremely long evolutionary branches. Six genes were excluded from further analysis because they were too poorly conserved for alignment and phylogenetic analysis (nad7, rps10, sdh3, sdh4, tatC, and yejV), leaving 32 genes for investigation: atp6, atp9, atpA, cob, cox1, cox2, cox3, nad1, nad2, nad3, nad4, nad4L, nad5, nad6, nad9, rpl16, rpl2, rpl5, rpl6, rps1, rps11, rps12, rps13, rps14, rps19, rps2, rps3, rps4, rps7, rps8, yejR, and yejU. Using FASTA with the mitochondrial proteins as a query, homologs were identified from the genomes of seven α-Proteobacteria: two intracellular symbionts (W. pipientis wMel and Rickettsia prowazekii) and five free-living forms (Sinorhozobium loti, Agrobacterium tumefaciens, Brucella melitensis, Mesorhizobium loti, and Rhodopseudomonas sp.). Escherichia coli and Neisseria meningitidis were used as outgroups. Caulobacter crescentus was excluded from analysis because homologs of some of the 32 genes were not found in the current annotation. In the event that more than one homolog was identified per genome, the one with the greatest sequence identity to the mitochondrial query was retrieved. Proteins were aligned using CLUSTALW (Thompson et al. 1994) and concatenated. To reduce the influence of poorly aligned regions, all sites that contained a gap at any position were excluded from analysis, leaving 6,776 positions per genome for analysis. The data contained extreme amino acid bias: all sequences failed the χ2 test at p = 0.95 for deviation from amino acid frequency distribution assumed under either the JTT or mtREV24 models as determined with PUZZLE (Strimmer and von Haeseler 1996). When the data were iteratively purged of highly variable sites using the method described (Hansmann and Martin 2000), amino acid composition gradually came into better agreement with acid frequency distribution assumed by the model. The longest dataset in which all sequences passed the χ2 test at p = 0.95 consisted of the 3,100 least polymorphic sites. PROTML (Adachi and Hasegawa 1996) analyses of the 3,100-site data using the JTT model detected mitochondria as sisters of the five free-living α-Proteobacteria with low (72%) support, whereas PUZZLE, using the same data, detected mitochondria as sisters of the two intracellular symbionts, also with low (85%) support. This suggested the presence of conflicting signal in the less-biased subset of the data. Therefore, protein log determinants (LogDet) were used to infer distances from the 6,776-site data, since the method can correct for amino acid bias (Lockhart et al. 1994), and Neighbor-Net (Bryant and Moulton 2003) was used to display the resulting matrix, because it can detect and display conflicting signal. The result (see Figure 5A) shows both signals. In no analysis was a sister relationship between Rickettsia and mitochondria detected.
For analyses of individual genes, the 63 proteins encoded in the Reclinomonas mitochondrial genome were compared with FASTA to the proteins from 49 sequenced eubacterial genomes, which included the α-Proteobacteria shown in Figure 5, R. conorii, and Magnetococcus MC1, one of the more divergent α-Proteobacteria. Of those proteins, 50 had sufficiently well-conserved homologs to perform phylogenetic analyses. Homologs were aligned and subjected to phylogenetic analysis with PROTML (Adachi and Hasegawa 1996).
Analysis of wspB sequences To compare wspB sequences from different Wolbachia strains, PCR was done on total DNA extracted from the following sources: wRi was obtained from infected adult D. simulans, Riverside strain; wAlbB was obtained from the infected Aa23 cell line (O’Neill et al. 1997b), and D. immitis Wolbachia was extracted from adult worm tissue. DNA extraction and PCR were done as previously described (Zhou et al. 1998) with wspB-specific primers (wspB-F, 5′-TTTGCAAGTGAAACAGAAGG and wspB-R, 5′-GCTTTGCTGGCAAAATGG). PCR products were cloned into pGem-T vector (Promega, Madison, Wisconsin, United States) as previously described (Zhou et al. 1998) and sequenced (Genbank accession numbers AJ580921–AJ508923). These sequences were compared to previously sequenced wsp genes for the same Wolbachia strains (Genbank accession numbers AF020070, AF020059, and AJ252062). The four partial wsp sequences were aligned using CLUSTALV (Higgins et al. 1992) based on the amino acid translation of each gene and similarly with the wspB sequences. Genetic distances were calculated using the Kimura 2 parameter method and are reported in Table S5.
Type IV secretion system To determine whether the vir-like CDSs, as well as adjacent ORFs, were actively expressed within wMel as two polycistronic operons, RT-PCR was used. Total RNA was isolated from infected D. melanogaster yw67c23 adults using Trizol reagent (Invitrogen, Carlsbad, California, United States) and cDNA synthesized using SuperScript III RT (Invitrogen) using primers wspBR, WD0817R, WD0853R, and WD0852R. RNA isolation and RT were done according to manufacturer’s protocols, with the exception that suggested initial incubation of RNA template and primers at 65°C for 5 min and final heat denaturation of RT-enzyme at 70°C for 15 min were not done. PCR was done using rTaq (Takara, Kyoto, Japan), and several primer sets were used to amplify regions spanning adjacent CDSs for most of the two operons. For operon virB3-WD0853, the following primers were used: (virB3virB4)F, (virB3virB4)R, (virB6-WD0856)F, (virB6-WD0856)R, (WD0856-WD0855)F, (WD0856-WD0855)R, (WD0854-WD0853)F, (WD0854-WD0853)R. For operon virB8wspB, the following primers were used: (virB8virB9)F, (virB8virB9)R, (virB9virB11)F, (virB9virB11)R, (virB11virD4)F, (virB11virD4)R, (virD4wspB)F, and (virD4wspB)R. The coexpression of virB4 and virB6, as well as WD0855 and WD0854, was confirmed within the putative virB3-WD0853 operon using nested PCR with the following primers: (virB4virB6)F1, (virB4virB6)R1, (virB4virB6)F2, (virB4virB6)R2, (WD0855-WD0854)F1, (WD0855-WD0854)R1, (WD0855-WD0854)F2, and (WD0855-WD0854)R2. All ORFs within the putative virB8wspB operon were shown to be coexpressed and are thus considered to be a genuine operon. All products were amplified only from RT-positive reactions (see Figure 6). Primer sequences are given in Table S9.

Supporting Information

Figure S1. Phage Trees

Phylogenetic tree showing the relationship between WO-A and WO-B phage from wMel with reported phage from wKue and wTai. The tree was generated from a CLUSTALW multiple sequence alignment (Thompson et al. 1994) using the PROTDIST and NEIGHBOR programs of PHYLIP (Felsenstein 1989).
(60 KB PDF).

Figure S2. Plot of the Effective Number of Codons against GC Content at the Third Codon Position (GC3)

Proteins with fewer than 100 residues are excluded from this analysis because their effective number of codon (ENc) values are unreliable. The curve shows the expected ENc values if codon usage bias is caused by GC variation alone. Colors: yellow, hypothetical; purple, mobile element; blue, others. Most of the variation in codon bias can be traced to variation in GC, indicating that the mutation forces dominate the wMel codon usage. Multivariate analysis of codon usage was performed using the CODONW package (available from http://www.molbiol.ox.ac.uk/cu/codonW.html).
(289 KB PDF).

Figure S3. Predicted Metabolism and Transport in wMel

Overview of the predicted metabolism (energy production and organic compounds) and transport in wMel. Transporters are grouped by predicted substrate specificity: inorganic cations (green), inorganic anions (pink), carbohydrates (yellow), and amino acids/peptides/amines/purines and pyrimidines (red). Transporters in the drug-efflux family (labeled as “drugs”) and those of unknown specificity are colored black. Arrows indicate the direction of transport. Energy-coupling mechanisms are also shown: solutes transported by channel proteins (double-headed arrow); secondary transporters (two-arrowed lines, indicating both the solute and the coupling ion); ATP-driven transporters (ATP hydrolysis reaction); unknown energy-coupling mechanism (single arrow). Transporter predictions are based upon a phylogenetic classification of transporter proteins (Paulsen et al. 1998).
(167 KB PDF).

Table S1. Repeats of Greater Than 50 bp in the wMel Genome (with Coordinates)

(649 KB DOC).

Table S2. Inactivated Genes in the wMel Genome

(147 KB DOC).

Table S3. Ortholog Comparison with Rickettsia spp.

(718 KB XLS).

Table S4. Putative Lineage-Specific Gene Duplications in wMel

(116 KB DOC).

Table S5. Genetic Distances as Calculated for Alignments of wsp and wspB Gene Sequences from the Same Wolbachia Strains

(24 KB DOC).

Table S6. Putative DNA Repair and Recombination Genes in the wMel Genome

(26 KB DOC).

Table S7. Phylogenetic Results for Concatenated Data of 32 Mitochondrial Proteins

(34 KB DOC).

Table S8. Individual Phylogenetic Results for Reclinomonas Mitochondrial DNA-Encoded Proteins

(117 KB DOC).

Table S9. PCR Primers

(47 KB DOC).

Accession Numbers

The complete sequence for wMel has been given GenBank (http://www.ncbi.nlm.nih.gov/Genbank/) accession ID number AE017196 and is available through the TIGR Comprehensive Microbial Resourceat http://www.tigr.org/tigr-scripts/CMR2/GenomePage3.spl?database=dmg
The GenBank accession numbers for other sequences discussed in this paper are AF020059 (Wolbachia sp. wAlbB outer surface protein precursor wsp gene), AF020070 (Wolbachia sp. wRi outer surface protein precursor wsp gene), AJ252062 (Wolbachia endosymbiont of D. immitis sp. gene for surface protein), AJ580921 (Wolbachia endosymbiont of D. immitis partial wspB gene for Wolbachia surface protein B), AJ580922 (Wolbachia endosymbiont of A. albopictus partial wspB gene for Wolbachia surface protein B), and AJ580923 (Wolbachia endosymbiont of D. simulans partial wspB gene for Wolbachia surface protein B).

Acknowledgments

We acknowledge Barton Slatko, Jeremy Foster, New England Biolabs, and Mark Blaxter for helping inspire this project; Rehka Seshadri for help in examining pathogenicity factors and reading the manuscript; Derek Fouts for examination of group II introns; Susan Lo, Michael Heaney, Vadim Sapiro, and Billy Lee for IT support; Maria-Ines Benito, Naomi Ward, Michael Eisen, Howard Ochman, and Vincent Daubin for helpful discussions; Steven Salzberg and Mihai Pop for help in comparing wMel with the D. melanogaster genome; Elodie Ghedin for access to the B. malayi Wolbachia sequence data; Maria Ermolaeva for assistance with analysis of operons; Dan Haft for designing protein family hidden Markov models for annotation; Owen White for general bioinformatics support; four anonymous reviewers for very helpful comments and suggestions; and Claire M. Fraser for continuing support of TIGR’s scientific research. This project was supported by grant UO1-AI47409–01 to Scott O’Neill and Jonathan A. Eisen from the National Institutes of Allergy and Infectious Diseases.
Conflicts of interest. The authors have declared that no conflicts of interest exist.
Author contributions. M. Wu contributed ideas and analysis in all aspects of the work. L. Sun performed purification of wMel DNA for initial libraries and closure. J. Vamathevan was the closure team leader, performed sequence assembly and analysis, and screened contigs against the Drosophila genome. M. Riegler performed validation of assembly against the physical map and confirmation of rearrangements by long PCR and analysis of repeat regions. R. Deboy was the annotation leader and managed the annotation, ORF management, and frameshifts. J. C. Brownlie performed analysis of Type IV secretion systems. E. A. McGraw performed validation of assembly against physical map and confirmation of rearrangements by long PCR and analysis of wsp paralogs. W. Martin, C. Esser, N. Ahmadinejad, and C. Wiegand performed the mitochondrial evolution analysis. R. Madupu, M. J. Beanan, L. M. Brinkac, S. C. Daugherty, A. S. Durkin, J. F. Kolonay, and W. C. Nelson performed genome annotation. Y. Mohamoud, P. Lee, and K. Berry performed the closure experiments (closed sequencing gaps, multiplex PCR, resolution of small repeats, coverage reactions, contig editing, resolution of large repeats by transposon and primer walking). M. B. Young was the shotgun sequencing leader. T. Utterback and J. Weidman performed shotgun sequencing and frameshift checking; Utterback also worked on the assembly. W. C. Nierman handled the library construction. I. T. Paulsen performed transporter analysis. K. E. Nelson performed metabolism analysis. H. Tettelin analyzed genome properties, repeats, and membrane proteins. S. L. O’Neill and J. A. Eisen supplied ideas, coordination, and analysis; Eisen is the corresponding author.

  1. Adachi J, Hasegawa M (1996) Model of amino acid substitution in proteins encoded by mitochondrial DNA. J Mol Evol 42:459–468. Find this article online
  2. Akman L, Yamashita A, Watanabe H, Oshima K, Shiba T, et al. (2002) Genome sequence of the endocellular obligate symbiont of tsetse flies, Wigglesworthia glossinidia. Nat Genet 32:402–407. Find this article online
  3. Andersson SG, Zomorodipour A, Andersson JO, Sicheritz-Ponten T, Alsmark UC, et al. (1998) The genome sequence of Rickettsia prowazekii and the origin of mitochondria. Nature 396:133–140. Find this article online
  4. Andersson SG, Karlberg O, Canback B, Kurland CG (2003) On the origin of mitochondria: A genomics perspective. Philos Trans R Soc Lond B Biol Sci 358:165–167. Find this article online
  5. Bazinet C, Rollins JE (2003) Rickettsia-like mitochondrial motility in Drosophila spermiogenesis. Evol Dev 5:379–385. Find this article online
  6. Bjorkholm B, Sjolund M, Falk PG, Berg OG, Engstrand L, et al. (2001) Mutation frequency and biological cost of antibiotic resistance in Helicobacter pylori. Proc Natl Acad Sci U S A 98:14607–14612. Find this article online
  7. Boyle L, O’Neill SL, Robertson HM, Karr TL (1993) Interspecific and intraspecific horizontal transfer of Wolbachia in Drosophila. Science 260:1796–1799. Find this article online
  8. Braig HR, Zhou W, Dobson SL, O’Neill SL (1998) Cloning and characterization of a gene encoding the major surface protein of the bacterial endosymbiont Wolbachia pipientis. J Bacteriol 180:2373–2378. Find this article online
  9. Bryant D, Moulton V (2003) Neighbor-Net: An agglomerative method for the construction of phylogenetic networks. Mol Biol Evol 20 Dec 5 [Epub ahead of print].
  10. Caturegli P, Asanovich KM, Walls JJ, Bakken JS, Madigan JE, et al. (2000) ankA: An Ehrlichia phagocytophila group gene encoding a cytoplasmic protein antigen with ankyrin repeats. Infect Immun 68:5277–5283. Find this article online
  11. Christie PJ (2001) Type IV secretion: Intercellular transfer of macromolecules by systems ancestrally related to conjugation machines. Mol Microbiol 40:294–305. Find this article online
  12. Delcher AL, Kasif S, Fleischmann RD, Peterson J, White O, et al. (1999) Alignment of whole genomes. Nucleic Acids Res 27:2369–2376. Find this article online
  13. Dumler SJ, Barbet AF, Bekker CPJ, Dasch GA, Palmer GH, et al. (2001) Reorganization of genera in the families Rickettsiaceae and Anaplasmataceae in the order Rickettsiales: Unification of some species of Ehrlichia with Anaplasma, Cowdria with Ehrlichia and Ehrlichia with Neorickettsia—Descriptions of six new species combinations and designation of Ehrlichiaqui and “HGE agent” as subjective synonyms of Ehrlichia phagocytophila. Intl J System Evol Microbiol 51:2145–2165. Find this article online
  14. Eiglmeier K, Parkhill J, Honore N, Garnier T, Tekaia F, et al. (2001) The decaying genome of Mycobacterium leprae. Lepr Rev 72:387–398. Find this article online
  15. Eisen JA (1997) Gastrogenomic delights: A movable feast. Nat Med 3:1076–1078. Find this article online
  16. Eisen JA (1998a) A phylogenomic study of the MutS family of proteins. Nucleic Acids Res 26:4291–4300. Find this article online
  17. Eisen JA (1998b) Phylogenomics: Improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res 8:163–167. Find this article online
  18. Eisen JA, Fraser CM (2003) Phylogenomics: Intersection of evolution and genomics. Science 300:1706–1707. Find this article online
  19. Eisen JA, Hanawalt PC (1999) A phylogenomic study of DNA repair genes, proteins, and processes. Mutat Res 435:171–213. Find this article online
  20. Eisen JA, Wu M (2002) Phylogenetic analysis and gene functional predictions: Phylogenomics in action. Theor Popul Biol 61:481–487. Find this article online
  21. Eisen JA, Heidelberg JF, White O, Salzberg SL (2000) Evidence for symmetric chromosomal inversions around the replication origin in bacteria. Genome Biol 1:1–9 RESEARCH0011. Find this article online
  22. Eisen JA, Nelson KE, Paulsen IT, Heidelberg JF, Wu M, et al. (2002) The complete genome sequence of Chlorobium tepidum TLS, a photosynthetic, anaerobic, green-sulfur bacterium. Proc Natl Acad Sci U S A 99:9509–9514. Find this article online
  23. Elfring LK, Axton JM, Fenger DD, Page AW, Carminati JL, et al. (1997) Drosophila PLUTONIUM protein is a specialized cell cycle regulator required at the onset of embryogenesis. Mol Biol Cell 8:583–593. Find this article online
  24. Emelyanov VV (2001a) Evolutionary relationship of Rickettsiae and mitochondria. FEBS Lett 501:11–18. Find this article online
  25. Emelyanov VV (2001b) Rickettsiaceae, Rickettsia-like endosymbionts, and the origin of mitochondria. Biosci Rep 21:1–17. Find this article online
  26. Emelyanov VV (2003a) Mitochondrial connection to the origin of the eukaryotic cell. Eur J Biochem 270:1599–1618. Find this article online
  27. Emelyanov VV (2003b) Phylogenetic affinity of a Giardia lamblia cysteine desulfurase conforms to canonical pattern of mitochondrial ancestry. FEMS Microbiol Lett 226:257–266. Find this article online
  28. Felsenstein J (1989) PHYLIP—Phylogeny inference package (version 3.2). Cladistics 5:164–166. Find this article online
  29. Frank AC, Amiri H, Andersson SG (2002) Genome deterioration: Loss of repeated sequences and accumulation of junk DNA. Genetica 115:1–12. Find this article online
  30. Gray MW, Burger G, Lang BF (2001) The origin and early evolution of mitochondria. Genome Biol 2:REVIEWS1018.
  31. Gupta RS (1995) Evolution of the chaperonin families (Hsp60, Hsp10 and Tcp-1) of proteins and the origin of eukaryotic cells. Mol Microbiol 15:1–11. Find this article online
  32. Hansmann S, Martin W (2000) Phylogeny of 33 ribosomal and six other proteins encoded in an ancient gene cluster that is conserved across prokaryotic genomes: Influence of excluding poorly alignable sites from analysis. Int J Syst Evol Microbiol 50:1655–1663. Find this article online
  33. Higgins D, Bleasby A, Fuchs R (1992) ClustalV: Improved software for multiple sequence alignment. Comput Appl Biosci 8:189–191. Find this article online
  34. Holmes DS, Bonner J (1973) Preparation, molecular weight, base composition, and secondary structure of giant nuclear ribonucleic acid. Biochemistry 12:2330–2338. Find this article online
  35. Hryniewicz-Jankowska A, Czogalla A, Bok E, Sikorsk AF (2002) Ankyrins, multifunctional proteins involved in many cellular pathways. Folia Histochem Cytobiol 40:239–249. Find this article online
  36. Itoh T, Martin W, Nei M (2002) Acceleration of genomic evolution caused by enhanced mutation rate in endocellular symbionts. Proc Natl Acad Sci U S A 99:12944–12948. Find this article online
  37. Jamnongluk W, Kittayapong P, Baimai V, O’Neill SL (2002) Wolbachia infections of tephritid fruit flies: Molecular evidence for five distinct strains in a single host species. Curr Microbiol 45:255–260. Find this article online
  38. Jeyaprakash A, Hoy MA (2000) Long PCR improves Wolbachia DNA amplification: wsp sequences found in 76% of sixty-three arthropod species. Insect Mol Biol 9:393–405. Find this article online
  39. Karlin S, Brocchieri L (2000) Heat shock protein 60 sequence comparisons: Duplications, lateral transfer, and mitochondrial evolution. Proc Natl Acad Sci U S A 97:11348–11353. Find this article online
  40. Kondo N, Nikoh N, Ijichi N, Shimada M, Fukatsu T (2002) Genome fragment of Wolbachia endosymbiont transferred to X chromosome of host insect. Proc Natl Acad Sci U S A 99:14280–14285. Find this article online
  41. Kurtz S, Schleiermacher C (1999) REPuter: Fast computation of maximal repeats in complete genomes. Bioinformatics 15:426–427. Find this article online
  42. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, et al. (2001) Initial sequencing and analysis of the human genome. Nature 409:860–921. Find this article online
  43. Lang BF, Burger G, O’Kelly CJ, Cedergren R, Golding GB, et al. (1997) An ancestral mitochondrial DNA resembling a eubacterial genome in miniature. Nature 387:493–497. Find this article online
  44. Lang BF, Seif E, Gray MW, O’Kelly CJ, Burger G (1999) A comparative genomics approach to the evolution of eukaryotes and their mitochondria. J Eukaryot Microbiol 46:320–326. Find this article online
  45. Lawrence JG (2001) Catalyzing bacterial speciation: Correlating lateral transfer with genetic headroom. Syst Biol 50:479–496. Find this article online
  46. Lawrence JG, Ochman H (1997) Amelioration of bacterial genomes: Rates of change and exchange. J Mol Evol 44:383–397. Find this article online
  47. Lawrence JG, Ochman H (1998) Molecular archaeology of the Escherichia coli genome. Proc Natl Acad Sci U S A 95:9413–9417. Find this article online
  48. Lin M, Rikihisha Y (2003) Ehrlichia chaffeensis and Anaplasma phagocytophilum lack genes for lipid A biosynthesis and incorporate cholesterol for their survival. Infect Immun 71:5324–5331. Find this article online
  49. Lo N, Casiraghi M, Salati E, Bazzocchi C, Bandi C (2002) How many Wolbachia supergroups exist? Mol Biol Evol 19:341–346. Find this article online
  50. Lockhart PJ, Steel MA, Hendy MD, Penny D (1994) Recovering evolutionary trees under a more realistic evolutionary model. Mol Biol Evol 11:605–612. Find this article online
  51. Martin W (1999) Mosaic bacterial chromosomes: A challenge en route to a tree of genomes. Bioessays 21:99–104. Find this article online
  52. Masui S, Sasaki T, Ishikawa H (2000) Genes for the type IV secretion system in an intracellular symbiont, Wolbachia, a causative agent of various sexual alterations in arthropods. J Bacteriol 182(22):6529–6531. Find this article online
  53. Masui S, Kuroiwa H, Sasaki T, Inui M, Kuroiwa T, et al. (2001) Bacteriophage WO and virus-like particles in Wolbachia, an endosymbiont of arthropods. Biochem Biophys Res Commun 283:1099–1104. Find this article online
  54. McGraw EA, Merritt DJ, Droller JN, O’Neill SL (2001) Wolbachia-mediated sperm modification is dependent on the host genotype in Drosophila. Proc R Soc Lond B Biol Sci 268:2565–2570. Find this article online
  55. Mira A, Ochman H, Moran NA (2001) Deletional bias and the evolution of bacterial genomes. Trends Genet 17:589–596. Find this article online
  56. Moran NA (1996) Accelerated evolution and Muller’s rachet in endosymbiotic bacteria. Proc Natl Acad Sci U S A 93:2873–2878. Find this article online
  57. Moran NA, Mira A (2001) The process of genome shrinkage in the obligate symbiont Buchnera aphidicola. Genome Biol 2:RESEARCH0054.
  58. Muller M, Martin W (1999) The genome of Rickettsia prowazekii and some thoughts on the origin of mitochondria and hydrogenosomes. Bioessays 21:377–381. Find this article online
  59. O’Neill SL, Hoffmann AA, Werren JH, editors (1997a) Influential passengers: Inherited microorganisms and arthropod reproduction. Oxford: Oxford University Press. 228 p.
  60. O’Neill SL, Pettigrew MM, Sinkins SP, Braig HR, Andreadis TG, et al. (1997b) In vitro cultivation of Wolbachia pipientis in an Aedes albopictus cell line. Insect Mol Biol 6:33–39. Find this article online
  61. Ogata H, Audic S, Renesto-Audiffren P, Fournier PE, Barbe V, et al. (2001) Mechanisms of evolution in Rickettsia conorii and R. prowazekii. Science 293:2093–2098. Find this article online
  62. Parkhill J, Wren BW, Thomson NR, Titball RW, Holden MT, et al. (2001) Genome sequence of Yersinia pestis, the causative agent of plague. Nature 413:523–527. Find this article online
  63. Parkhill J, Sebaihia M, Preston A, Murphy LD, Thomson N, et al. (2003) Comparative analysis of the genome sequences of Bordetella pertussis, Bordetella parapertussis and Bordetella bronchiseptica. Nat Genet 35:32–40. Find this article online
  64. Paulsen IT, Sliwinski MK, Saier MH Jr (1998) Microbial genome analyses: Global comparisons of transport capabilities based on phylogenies, bioenergetics and substrate specificities. J Mol Biol 277:573–592. Find this article online
  65. Paulsen IT, Nguyen L, Sliwinski MK, Rabus R, Saier MH Jr (2000) Microbial genome analyses: Comparative transport capabilities in eighteen prokaryotes. J Mol Biol 301:75–100. Find this article online
  66. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO (1999) Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles. Proc Natl Acad Sci U S A 96:4285–4288. Find this article online
  67. Penny D, McComish BJ, Charleston MA, Hendy MD (2001) Mathematical elegance with biochemical realism: The covarion model of molecular evolution. J Mol Evol 53:711–723. Find this article online
  68. Read TD, Brunham RC, Shen C, Gill SR, Heidelberg JF, et al. (2000) Genome sequences of Chlamydia trachomatis MoPn and Chlamydia pneumoniae AR39. Nucleic Acids Res 28:1397–1406. Find this article online
  69. Roelofs J, Van Haastert PJ (2001) Genes lost during evolution. Nature 411:1013–1014. Find this article online
  70. Salzberg SL, Delcher AL, Kasif S, White O (1998) Microbial gene identification using interpolated Markov models. Nucleic Acids Res 26:544–548. Find this article online
  71. Salzberg SL, White O, Peterson J, Eisen JA (2001) Microbial genes in the human genome: Lateral transfer or gene loss? Science 292:1903–1906. Find this article online
  72. Selby CP, Witkin EM, Sancar A (1991) Escherichia coli mfd mutant deficient in “mutation frequency decline” lacks strand-specific repair: In vitro complementation with purified coupling factor. Proc Natl Acad Sci U S A 88:11574–11578. Find this article online
  73. Seshadri R, Paulsen IT, Eisen JA, Read TD, Nelson KE, et al. (2003) Complete genome sequence of the Q-fever pathogen Coxiella burnetii. Proc Natl Acad Sci U S A 100:5455–5460. Find this article online
  74. Sexton JA, Vogel JP (2002) Type IVB secretion by intracellular pathogens. Traffic 3:178–185. Find this article online
  75. Shigenobu S, Watanabe H, Hattori M, Sakaki Y, Ishikawa H (2000) Genome sequence of the endocellular bacterial symbiont of aphids Buchnera sp. APS. Nature 407:81–86. Find this article online
  76. Sicheritz-Ponten T, Kurland CG, Andersson SG (1998) A phylogenetic analysis of the cytochrome b and cytochrome c oxidase I genes supports an origin of mitochondria from within the Rickettsiaceae. Biochim Biophys Acta 1365:545–551. Find this article online
  77. Sinkins SP, O’Neill SL (2000) Wolbachia as a vehicle to modify insect populations. In: James AA, editor. Insect transgenesis: Methods and applications. Boca Raton, Florida: CRC Press. 271–288.
  78. Stanhope MJ, Lupas A, Italia MJ, Koretke KK, Volker C, et al. (2001) Phylogenetic analyses do not support horizontal gene transfers from bacteria to vertebrates. Nature 411:940–944. Find this article online
  79. Strimmer K, von Haeseler A (1996) Quartet puzzling: A quartet maximum-likelihood method for reconstructing tree topologies. Mol Biol Evol 13:964–969. Find this article online
  80. Sun LV, Foster JM, Tzertzinis G, Ono M, Bandi C, et al. (2001) Determination of Wolbachia genome size by pulsed-field gel electrophoresis. J Bacteriol 183:2219–2225. Find this article online
  81. Sun LV, Riegler M, O’Neill SL (2003) Development of a physical and genetic map of the virulent Wolbachia strain wMelPop. J Bacteriol 185:7077–7084. Find this article online
  82. Sutton G, White O, Adams M, Kerlavage A (1995) TIGR assembler: A new tool for assembling large shotgun sequencing projects. Genome Sci Tech 1:9–19. Find this article online
  83. Tamas I, Klasson L, Canback B, Naslund AK, Eriksson AS, et al. (2002) 50 million years of genomic stasis in endosymbiotic bacteria. Science 296:2376–2379. Find this article online
  84. Taylor MJ (2002) A new insight into the pathogenesis of filarial disease. Curr Mol Med 2:299–302. Find this article online
  85. Taylor MJ, Hoerauf A (2001) A new approach to the treatment of filariasis. Curr Opin Infect Dis 14:727–731. Find this article online
  86. Taylor MJ, Bandi C, Hoerauf AM, Lazdins J (2000) Wolbachia bacteria of filarial nematodes: A target for control? Parasitol Today 16:179–180. Find this article online
  87. Tettelin H, Radune D, Kasif S, Khouri H, Salzberg SL (1999) Optimized multiplex PCR: Efficiently closing a whole-genome shotgun sequencing project. Genomics 62:500–507. Find this article online
  88. Tettelin H, Nelson KE, Paulsen IT, Eisen JA, Read TD, et al. (2001) Complete genome sequence of a virulent isolate of Streptococcus pneumoniae. Science 293:498–506. Find this article online
  89. Thompson JD, Higgins DG, Gibson TJ (1994) ClustalW: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673–4680. Find this article online
  90. Tram U, Sullivan W (2002) Role of delayed nuclear envelope breakdown and mitosis in Wolbachia-induced cytoplasmic incompatibility. Science 296:1124–1126. Find this article online
  91. van Ham RC, Kamerbeek J, Palacios C, Rausell C, Abascal F, et al. (2003) Reductive genome evolution in Buchnera aphidicola. Proc Natl Acad Sci U S A 100:581–586. Find this article online
  92. Venter JC, Smith HO, Hood L (1996) A new strategy for genome sequencing. Nature 381:364–366. Find this article online
  93. Viale AM, Arakaki AK (1994) The chaperone connection to the origins of the eukaryotic organelles. FEBS Lett 341:146–151. Find this article online
  94. Volfovsky N, Haas BJ, Salzberg SL (2001) A clustering method for repeat analysis in DNA sequences. Genome Biol 2:RESEARCH0027.
  95. Ware J, Moran L, Foster J, Posfai J, Vincze T, et al. (2002) Sequencing and analysis of a 63 kb bacterial artificial chromosome insert from the Wolbachia endosymbiont of the human filarial parasite Brugia malayi. Int J Parasitol 32:159–166. Find this article online
  96. Wernegreen J, Moran NA (1999) Evidence for genetic drift in endosymbionts (Buchnera): Analyses of protein-coding genes. Mol. Biol. Evol 16:83–97. Find this article online
  97. Werren JH (1998) Wolbachia and speciation. In: Berlocher SH, editor. Endless forms: Species and speciation. New York: Oxford University Press. 245–260.
  98. Werren JH, O’Neill SL (1997) The evolution of heritable symbionts. In: O’Neill SL, Hoffmann AA, Werren JH, editors. Influential passengers: Inherited microorganisms and arthropod reproduction. Oxford: Oxford University Press. 1–41.
  99. Werren JH, Windsor DM (2000) Wolbachia infection frequencies in insects: Evidence of a global equilibrium? Proc R Soc Lond B Biol Sci 267:1277–1285. Find this article online
  100. Witkin EM (1994) Mutation frequency decline revisited. Bioessays 16:437–444. Find this article online
  101. Zhou W, Rousset F, O’Neill SL (1998) Phylogeny and PCR-based classification of Wolbachia strains using wsp gene sequences. Proc R Soc Lond B Biol Sci 265:509–515. Find this article online

Global Ocean Survey to be on PBS Newshour with Jim Lehrer

Apparently they are running a story on the Venter Global Ocean Survey project on the NewsHour tonight

Not sure exactly what they are saying but good that it has made it to my favorite news show.

Evidence for symmetric chromosomal inversions around the replication origin in bacteria

I am posting here my first Open Access article, from Genome Biology in 2000.

Research

.

Evidence for symmetric chromosomal inversions around the replication origin in bacteria
Jonathan A Eisen , John F Heidelberg, Owen White and Steven L Salzberg

The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA

Genome Biology 2000, 1:research0011.1-0011.9 doi:10.1186/gb-2000-1-6-research0011

Subject areas: Genome studies, Microbiology and parasitology, Evolution

The electronic version of this article is the complete one and can be found online at: http://genomebiology.com/2000/1/6/research/0011

Received

7 August 2000

Revisions received

25 September 2000

Accepted

19 October 2000

Published

4 December 2000

© 2000 GenomeBiology.com

Outline

Abstract

Abstract

Background

Results and discussion

Conclusions

Materials and methods

Acknowledgements

References

Background

Whole-genome comparisons can provide great insight into many aspects of biology. Until recently, however, comparisons were mainly possible only between distantly related species. Complete genome sequences are now becoming available from multiple sets of closely related strains or species.

Results

By comparing the recently completed genome sequences of Vibrio cholerae, Streptococcus pneumoniae and Mycobacterium tuberculosis to those of closely related species – Escherichia coli, Streptococcus pyogenes and Mycobacterium leprae, respectively – we have identified an unusual and previously unobserved feature of bacterial genome structure. Scatterplots of the conserved sequences (both DNA and protein) between each pair of species produce a distinct X-shaped pattern, which we call an X-alignment. The key feature of these alignments is that they have symmetry around the replication origin and terminus; that is, the distance of a particular conserved feature (DNA or protein) from the replication origin (or terminus) is conserved between closely related pairs of species. Statistically significant X-alignments are also found within some genomes, indicating that there is symmetry about the replication origin for paralogous features as well.

Conclusions

The most likely mechanism of generation of X-alignments involves large chromosomal inversions that reverse the genomic sequence symmetrically around the origin of replication. The finding of these X-alignments between many pairs of species suggests that chromosomal inversions around the origin are a common feature of bacterial genome evolution.

Outline

Background

Abstract

Background

Results and discussion

Conclusions

Materials and methods

Acknowledgements

References

Large-scale genomic rearrangements and duplications are important in the evolution of species. Previously, these large-scale genome-changing events were studied through genetic or cytological studies. With the availability of many complete genome sequences it is now possible to study such events through comparative genomics. The publication of the yeast genome has led to much better insight into the duplication events that have occurred in fungal and eukaryotic evolution (for example, see [1]). Large chromosomal duplications have also been found from analysis of completed chromosomes of Arabidopsis thaliana [2,3]. The ability to detect large-scale genomic changes is dependent in large part on which genomes are available. Such studies in bacteria, for example, have been limited by the availability of genomes only from distantly related sets of species. Recently, however, the genomes of sets of closely related bacterial species have become available. We have compared these closely related bacterial genomes and have discovered an unusual phenomenon – alignments of whole genomes that show an X-shaped pattern (which we refer to as X-alignments). Here we present the evidence for these X-alignments and discuss mechanisms that might have produced them.

Outline

Results and discussion

Abstract

Background

Results and discussion

Conclusions

Materials and methods

Acknowledgements

References

Figures


Figure 1

Between-species whole-genome DNA alignments


Figure 2

Whole-genome proteome alignments


Figure 3

Within-genome DNA alignments


Figure 4

Schematic model of genome inversions

Tables


Table 1

Whole-genome DNA alignments using MUMmer


Table 2

Whole-genome protein-level comparisons

Whole-genome X-alignments between species at the DNA level

We compared the DNA sequences of the two chromosomes of Vibrio cholerae [4] with the sequence of the Escherichia coli chromosome [5] using a suffix tree alignment algorithm [6]. The analysis revealed a significant alignment at the DNA level between the V. cholerae large chromosome (chrI) [4] and the E. coli chromosome [5] spanning the entire length of these chromosomes (Figure 1a). Analysis of the reverse complement of V. cholerae chrI with E. coli also produced a significant alignment (Figure 1b). When superimposed, the two alignments produce a clear ‘X’ shape (Figure 1c) that is symmetric about the origin of replication of both genomes. This symmetry indicates that matching sequences tend to occur at the same distance from the origin but not necessarily on the same side of the origin. The X-alignment between V. cholerae and E. coli was found to be statistically significant using a test based on the number of matches found in diagonal strips in the alignment (see the Materials and methods section). Specifically, when V. cholerae chrI is aligned in the forward direction against E. coli, there are 459 maximal unique matching subsequences (MUMs; see the Materials and methods section), of which 177 occurred in a diagonal strip covering 10% of the total area (compared to the expected value of 46). The probability of observing this high a number of MUMs by chance is 4.7 × 10-59. The alignment of V. cholerae chrI in the reverse direction against E. coli (which corresponds to the MUMs on the anti-diagonal) has a probability of 1.8 × 10-90. As a control, we compared the genomes of distantly related species, such as E. coli and Mycobacterium tuberculosis. These do not show a significant X-alignment (Table 1).

We have found that X-alignments of whole genomes are not limited to the V. cholerae versus E. coli comparison. For example, a whole-genome comparison of two bacteria in the genus Streptococcus – S. pyogenes [7] and S. pneumoniae (H. Tettelin, personal communication) – reveals a global X-alignment similar to that of V. cholerae versus E. coli (Figure 1d) which is also statistically significant (Table 1). In addition, an X-alignment is found between two species in the genus Mycobacterium – M. tuberculosis [8] and M. leprae [9] (Figure 1e) – as well as between two strains of Helicobacter pylori (data not shown). The X-alignments observed between any two pairs of genomes are not identical in every aspect. For example, in the alignment between the two Mycobacterium species, each conserved region is much longer than in the other genome pairs. We believe this is due to different numbers of evolutionary events between the species (see below). Whole-genome X-alignments were not found between any other pairs of species, although a related pattern was seen between some of the chlamydial species (see below).

Whole-genome X-alignments between species are also found at the proteome level

To test whether the X-alignments found in the DNA analysis could also be found at the level of whole proteomes, we conducted comparisons of homologous proteins between species (see the Materials and methods section). Figure 2a shows a scatterplot of chromosome positions of all proteins homologous between V. cholerae chrI and E. coli. The presence of many large gene families causes a great deal of noise in this comparison. This noise can be reduced by considering only the best matching homolog for each open reading frame (ORF), rather than all protein homologs (Figure 2b). This filtered protein comparison results in an X-alignment that is statistically significant (Table 2).

Whole-genome X-alignments within species

The finding of the X-alignment pattern between species led us to search for similar patterns within species; that is, global alignments of a genome with its own reverse complement. Of the genomes for which we found between-species X-alignments (M. tuberculosis, M. leprae, S. pyogenes, S. pneumoniae, E. coli and V. cholerae), statistically significant self-alignments are detected for all except M. tuberculosis (Figure 3; probabilities shown in Table 1). Interestingly, these self-alignments are not as strong as those between species. Proteome analysis also shows an X-alignment within species (shown for V. cholerae chrI in Figure 2d; probabilities shown in Table 2). The X-alignment of proteins within V. cholerae chrI is statistically significant only for recently duplicated-genes, but disappears when all paralogs are included. The importance of filtering for recent duplications is discussed below.

Model I: whole-genome inverted duplications

One possible explanation for an X-alignment within and between species is an ancestral inverted duplication of the whole genome, as has been suggested for E. coli [10]. The weak or missing X-alignment within species could be explained by gene loss of one of the two duplicates of many of the pairs of genes in the different lineages. Gene loss has been found to follow large chromosomal or genome duplications [11,12,13]. This gene loss is thought to stabilize large duplications by preventing recombination events between duplicate genes. If gene loss is responsible for the weak X-alignment within species, then to maintain the X-alignments between species, the member of the gene pair lost in a particular lineage should be essentially random. If an ancient inverted duplication followed by differential gene loss is the correct explanation for the observed X-alignments, one would expect the genes along one diagonal to be orthologous between species (related to each other by the speciation event), while the genes along the other diagonal should be paralogous (related to each other by the genome duplication event before the speciation of the two lineages). However, the evidence appears to contradict this model: likely orthologous gene pairs are equally distributed on each diagonal (data not shown).

Model II: chromosomal inversions about the origin and/or terminus

A second possible explanation for the X-alignments is that an underlying mechanism allows sections of DNA to move within the genome but maintains the distance of these sections from the origin and/or terminus. There are a variety of possible mechanisms for such movement, but we believe the most likely explanation is the occurrence of large chromosomal inversions that pivot around the replication origin and/or terminus. Large chromosomal inversions, including those that occur around the replication origin and terminus, have been shown to occur in E. coli and Salmonella typhimurium in the laboratory (see, for example, [14,15,16,17,18]). The occurrence of such inversions over evolutionary time scales was first suggested by comparative analysis of the complete genomes of four strains in the genus Chlamydia [19]. In that study, we found that the major chromosomal differences between C. pneumoniae and C. trachomatis (shown in Figure 2c) were consistent with the occurrence of large inversions that pivoted around the origin and terminus (including multiple inversions of different sizes). In Figure 4 we present a hypothetical model showing how a small number of inversions centered around the origin or terminus could produce patterns very similar to those seen in the Chlamydia, Mycobacterium and Helicobacter comparisons. The continued occurrence of such inversion over longer time scales would result in an X-alignment similar to that seen in the V. cholerae versus E. coli and S. pneumoniae versus S. pyogenes comparisons. Thus the different between-species X-alignments could be the result of different numbers of inversions between particular pairs of species.

Inversions about the origin and terminus could also produce an X-alignment within species, through the splitting of tandemly duplicated sequence. Many sets of tandemly duplicated genes are found in most bacterial genomes [19,20] (also see Figure 3a,c). As tandem duplications are inherently unstable (one of the duplicates can be rapidly eliminated by slippage and/or recombination events [21]), the fact that many tandem pairs are present within each genome suggests that tandem duplications occur frequently. Thus, it is reasonable to assume that occasionally a large inversion will split a pair of tandemly duplicated genes. An inversion that pivots about the origin and also splits a tandem duplication will result in a pair of paralogous genes spaced symmetrically on opposite sides of the origin.

If our inversion model is correct, then the genes along both diagonals in the between-species alignments should be orthologous, which is the case (see above). In contrast, genes along the anti-diagonal in the within-species X-alignments should be recent tandem duplicates that have been separated by inversions. This also appears to be the case – in the within-species analysis of V. cholerae chrI ORFs, the X-alignment shows up best when only recent duplicates are analyzed (Figure 2d). The splitting of tandem duplicates by inversions may be a general mechanism to stabilize the coexistence of duplicated genes, as it will prevent their elimination by unequal crossing-over or replication slippage events.

What could cause inversions that pivot around the origin and terminus of the genome to occur more frequently than other inversions? One possibility is that many inversions occur, but there is selection against those that change the distance of a gene from the origin or terminus. Such a possibility has been suggested by experimental work in E. coli [14,15]. Additional studies have, however, suggested that there is little selective difference between inversions and that instead there may be certain regions that are more prone to inversion than others [16,17,18,22,23]. Alternatively, the inversion events could be linked to replication, as has been suggested for small local inversion events [24]. Whatever the mechanisms, the fact that we find evidence for such inversions between many pairs of species suggests that they are a common feature of bacterial evolution. Many aspects of the X-alignments require further exploration. For example, to split a tandem duplication, an inversion must fall precisely on the boundary between two duplicated genes. This would appear to be unlikely, requiring a large number of inversions in order to generate a sufficient number of split gene pairs. If the mechanisms of gene duplication are somehow related to the mechanisms of inversion, however, then this model is more plausible. The process of duplicating a gene, if it occurs during replication, might promote a recombination event within the bacterial chromosome that inverts the sequence from the origin up to that point. As with inversion events, recombination and replication have been found to be tightly coupled [25].

Conclusions

We present here a novel observation regarding the conservation between bacterial species of the distance of particular genes from the replication origin or terminus. The initial observation was only possible due to the availability of complete genome sequences from pairs of moderately closely related species (for example, V. cholerae and E. coli). This shows the importance of having genome pairs from many levels of evolutionary relatedness. Comparisons of distantly related species enable the determination of universal features of life as well as of events that occur very rarely. Comparison of very closely related species allows the identification of frequent events such as transitional changes at third codon positions or tandem duplications. To elucidate all other events in the history of life, genome pairs covering all the intermediate levels of evolutionary relatedness will be needed.

Outline

Materials and methods

Abstract

Background

Results and discussion

Conclusions

Materials and methods

Acknowledgements

References

Genomes analyzed

Complete published genome sequences were obtained from the National Center for Biotechnology Information website [26] or from the TIGR Comprehensive Microbial Resource [27]. These included Aeropyrum pernix [28], Aquifex aeolicus [29], Archaeoglobus fulgidus [30], Bacillus subtilis [31], Borrelia burgdorferi [32], Campylobacter jejuni [33], Chlamydia pneumoniae AR39 [19], Chlamydia pneumoniae CWL029 [34], Chlamydia trachomatis (D/UW-3/Cx) [35], Chlamydia trachomatis MoPn [19], Deinococcus radiodurans [36], Escherichia coli [5], Haemophilus influenzae [37], Helicobacter pylori [38], Helicobacter pylori J99 [39], Methanobacterium thermoautotrophicum [40], Methanococcus jannaschii [41], Mycobacterium tuberculosis [8], Mycoplasma genitalium [42], Mycoplasma pneumoniae [43], Neisseria meningitidis MC58 [20], Neisseria meningitidis serogroup A strain Z2491 [44], Pyrococcus horikoshii [45], Rickettsia prowazekii [46], Synechocystis sp. [47], Thermotoga maritima [48], Treponema pallidum [49], and Vibrio cholerae [4]. In addition, a few unpublished genomes were analyzed: Streptococcus pyogenes (obtained from the Oklahoma University Genome Center website [7]), Streptococcus pneumoniae (H. Tettelin, personal communication), and Mycobacterium leprae (obtained from the Sanger Centre Pathogen Sequencing Group website [9]).

Whole-genome DNA alignments

DNA alignments of the complete genomic sequences of all bacteria used in this study were accomplished with the MUMmer program [6]. This program uses an efficient suffix tree construction algorithm to rapidly compute alignments of entire genomes. The algorithm identifies all exact matches of nucleotide subsequences that are contained in both input sequences; these exact matches must be longer than a specified minimum length, which was set to 20 base pairs for this comparison. To search for genome-scale alignments within species, complete bacterial and archaeal genomes (25 in total including all published genomes) were aligned with their own reverse complements. To search for between-species alignments, all genomes were aligned against all others in both orientations.

Whole-genome protein comparisons

The predicted proteome of each complete genome sequence (all predicted proteins in the genome) was compared to the proteomes of all complete genome sequences (including itself) using the fasta3 program [50]. Matches with an expected score (e-value) of 10-5 or less were considered significant.

Statistical significance of X-alignments

To calculate the statistical significance of the X-alignments, the maximal unique matching subsequences (MUMs) for unrelated genomes were examined and found to be uniformly distributed [6]. With a uniform background, the expected density of MUMs in any region of an alignment plot is a simple proportion of the area of that region to the entire plot. In particular, in an alignment with N total MUMs, the probability (Pr) of observing at least m matches in a region with area p can be computed using the binomial distribution in Equation 1:

The alignment of V. cholerae chrI (both forward and reverse strands) versus E. coli contains 926 MUMs. The MUMs forming X-alignments appear along the diagonal (y = x) and the anti-diagonal (y = L -x, where L is the genome length). To estimate the significance of the alignments in both directions, diagonal strips were sampled along each of the diagonals. The width of each strip was set at 10% of the plot area and significance values were calculated (Table 1).

Identification of origins of replication

The origins of replication for the bacterial genomes have been characterized by a variety of methods. For E. coli, M. tuberculosis and M. leprae, the origins have been well-characterized by laboratory studies [51,52]. The origins and termini of C. trachomatis, C. pneumoniae and V. cholerae were identified by GC-skew [53] and by characteristic genes in the region of the origin [4,19]. GC-skew uses the function (G-C)/(G+C) computed on 2,000 bp windows across the genome, which exhibits a clear tendency in many bacterial genomes to be positive for the leading strand and negative for the lagging strand. The origin of H. pylori was determined by oligomer skew [54] and confirmed by GC-skew. The origins and termini of S. pneumoniae and S. pyogenes were determined by the authors of the present study using GC-skew analysis and the locations of characteristic genes, particularly the chromosome replication initiator gene dnaA.

Outline

Acknowledgements

Abstract

Background

Results and discussion

Conclusions

Materials and methods

Acknowledgements

References

We thank S. Eddy, M.A. Riley, T. Read, A. Stoltzfus, M-I Benito and I. Paulsen for helpful comments, suggestions and discussions. S.L.S. was supported in part by NSF grant IIS-9902923 and NIH grant R01 LM06845. S.L.S. and J.A.E were supported in part by NSF grant KDI-9980088. Data for all published complete genome sequences were obtained from the NCBI genomes database [26] or from The Institute for Genomic Research (TIGR) Microbial Genome Database [27]. The sequences of V. cholerae, S. pneumoniae, and M. tuberculosis (CDC 1551) were determined at TIGR with support from NIH and the NIAID. The M. leprae sequence data were produced by the Pathogen Sequencing Group at the Sanger Centre. Sequencing of M. leprae is funded by the Heiser Program for Research in Leprosy and Tuberculosis of The New York Community Trust and by L’Association Raoul Follereau. The M. tuberculosis CDC 1551 genome sequence was obtained from TIGR. The source of the S. pyogenes genome sequence was the Streptococcal Genome Sequencing Project funded by USPHS/NIH grant AI38406, and was kindly made available by B. A. Roe, S.P. Linn, L. Song, X. Yuan, S. Clifton, R.E. McLaughlin, M. McShan and J. Ferretti, and can be obtained from the website of the Oklahoma University Genome Center [7].

Outline

References

Abstract

Background

Results and discussion

Conclusions

Materials and methods

Acknowledgements

References

1.

Seoighe C, Wolfe KH: Updated map of duplicated regions in the yeast genome.

Gene 1999, 238:253-261. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

2.

Lin X, Kaul S, Rounsley S, Shea TP, Benito MI, Town CD, Fujii CY, Mason T, Bowman CL, Barnstead M, et al.: Sequence and analysis of chromosome 2 of the plant Arabidopsis thaliana.

Nature 1999, 402:761-768. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

3.

Mayer K, Schuller C, Wambutt R, Murphy G, Volckaert G, Pohl T, Dusterhoft A, Stiekema W, Entian KD, Terryn N, et al.: Sequence and analysis of chromosome 4 of the plant Arabidopsis thaliana.

Nature 1999, 402:769-777. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

4.

Heidelberg JF, Eisen JA, Nelson WC, Clayton RA, Gwinn ML, Dodson RJ, Haft DH, Hickey EK, Peterson JD, Umayam L, et al.: The genome sequence of Vibrio cholerae, the aetiologic agent of cholera.

Nature 2000, 406:477-483. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1] [2] [3] [4]

5.

Blattner FR, Plunkett GI, Bloch CA, Perna NT, Burland V, Riley M, Collado-Vides J, Glasner JD, Rode CK, Mayhew GF, et al.: The complete genome sequence of Escherichia coli K-12.

Science 1997, 277:1453-1462. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1] [2] [3]

6.

Delcher AL, Kasif S, Fleischmann RD, Peterson J, White O, Salzberg SL: Alignment of whole genomes.

Nucleic Acids Res 1999, 27:2369-2376. [PubMed Abstract] [Publisher Full Text] [PubMed Central Full Text]

Return to citation in text: [1] [2] [3]

7.

Oklahoma University Genome Center [http://www.genome.ou.edu/strep.html]

Return to citation in text: [1] [2] [3]

8.

Cole ST, Brosch R, Parkhill J, Garnier T, Churcher C, Harris D, Gordon SV, Eiglmeier K, Gas S, Barry CE III, et al.: Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence.

Nature 1998, 393:537-544. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1] [2]

9.

Sanger Centre Pathogen Sequencing Group [ftp://ftp.sanger.ac.uk/pub/pathogens/leprae]

Return to citation in text: [1] [2]

10.

Zipkas D, Riley M: Proposal concerning mechanism of evolution of the genome of Escherichia coli.

Proc Natl Acad Sci USA 1975, 72:1354-1358. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

11.

Wagner A: The fate of duplicated genes: loss or new function?

BioEssays 1998, 20:785-788. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

12.

Lynch M, Force A: The probability of duplicate gene preservation by subfunctionalization.

Genetics 2000, 154:459-473. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

13.

Nadeau JH, Sankoff D: Comparable rates of gene loss and functional divergence after genome duplications early in vertebrate evolution.

Genetics 1997, 147:1259-1266. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

14.

Francois V, Louarn J, Patte J, Rebollo JE, Louarn JM: Constraints in chromosomal inversions in Escherichia coli are not explained by replication pausing at inverted terminator-like sequences.

Mol Microbiol 1990, 4:537-542. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1] [2]

15.

Rebollo JE, Francois V, Louarn JM: Detection and possible role of two large nondivisible zones on the Escherichia coli chromosome.

Proc Natl Acad Sci USA 1988, 85:9391-9395. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1] [2]

16.

Segall A, Mahan MJ, Roth JR: Rearrangement of the bacterial chromosome: forbidden inversions.

Science 1988, 241:1314-1318. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1] [2]

17.

Mahan MJ, Roth JR: Ability of a bacterial chromosome segment to invert is dictated by included material rather than flanking sequence.

Genetics 1991, 129:1021-1032. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1] [2]

18.

Segall AM, Roth JR: Recombination between homologies in direct and inverse orientation in the chromosome of Salmonella : intervals which are nonpermissive for inversion formation.

Genetics 1989, 122:737-747. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1] [2]

19.

Read TD, Brunham RC, Shen C, Gill SR, Heidelberg JF, White O, Hickey EK, Peterson J, Utterback T, Berry K, et al.: Genome sequences of Chlamydia trachomatis MoPn and Chlamydia pneumoniae AR39.

Nucleic Acids Res 2000, 28:1397-1406. [PubMed Abstract] [Publisher Full Text] [PubMed Central Full Text]

Return to citation in text: [1] [2] [3] [4] [5]

20.

Tettelin H, Saunders NJ, Heidelberg J, Jeffries AC, Nelson KE, Eisen JA, Ketchum KA, Hood DW, Peden JF, Dodson RJ, et al.: Complete genome sequence of Neisseria meningitidis serogroup B strain MC58.

Science 2000, 287:1809-1815. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1] [2]

21.

Force A, Lynch M, Pickett FB, Amores A, Yan YL, Postlethwait J: Preservation of duplicate genes by complementary, degenerative mutations.

Genetics 1999, 151:1531-1545. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

22.

Schmid MB, Roth JR: Selection and endpoint distribution of bacterial inversion mutations.

Genetics 1983, 105:539-557. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

23.

Mahan MJ, Roth JR: Reciprocality of recombination events that rearrange the chromosome.

Genetics 1988, 120:23-35. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

24.

Gordon AJ, Halliday JA: Inversions with deletions and duplications.

Genetics 1995, 140:411-414. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

25.

Valencia-Morales E, Romero D: Recombination enhancement by replication (RER) in Rhizobium etli.

Genetics 2000, 154:971-983. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

26.

National Center for Biotechnology Information, Entrez Genomes [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome]

Return to citation in text: [1] [2]

27.

The Institute for Genomic Research Microbial Genome Database [http://www.tigr.org/tdb/mdb/mdb.html]

Return to citation in text: [1] [2]

28.

Kawarabayasi Y, Hino Y, Horikawa H, Yamazaki S, Haikawa Y, Jin-no K, Takahashi M, Sekine M, Baba S, Ankai A, et al.: Complete genome sequence of an aerobic hyper-thermophilic crenarchaeon, Aeropyrum pernix K1.

DNA Res 1999, 6:83-101. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

29.

Deckert G, Warren PV, Gaasterland T, Young WG, Lenox AL, Grahams DE, Overbeek R, Snead MA, Keller M, Aujay M, et al.: The complete genome of the hyperthemophilic bacterium Aquifex aeolicus.

Nature 1998, 392:353-358. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

30.

Klenk H-P, Clayton RA, Tomb J-F, White O, Nelsen KE, Ketchum KA, Dodson RJ, Gwinn M, Hickey EK, Peterson JD, et al.: The complete genomic sequence of the hyperthermophilic, sulfate-reducing archaeon Archaeoglobus fulgidus.

Nature 1997, 390:364-370. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

31.

Kunst A, Ogasawara N, Moszer I, Albertini A, Alloni G, Azevedo V, Bertero M, Bessieres P, Bolotin A, Borchert S, et al.: The complete genome sequence of the Gram-positive bacterium Bacillus subtilis.

Nature 1997, 390:249-256. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

32.

Fraser CM, Norris SJ, Weinstock GM, White O, Sutton GG, Dodson R, Gwinn M, Hickey EK, Clayton R, Ketchum KA, et al.: Genomic sequence of a Lyme disease spirochaete, Borrelia burgdorferi.

Nature 1997, 390:580-586. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

33.

Parkhill J, Wren BW, Mungall K, Ketley JM, Churcher C, Basham D, Chillingworth T, Davies RM, Feltwell T, Holroyd S, et al.: The genome sequence of the food-borne pathogen Campylobacter jejuni reveals hypervariable sequences.

Nature 2000, 403:665-668. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

34.

Kalman S, Mitchell W, Marathe R, Lammel C, Fan J, Hyman RW, Olinger L, Grimwood J, Davis RW, Stephens RS: Comparative genomes of Chlamydia pneumoniae and C. trachomatis.

Nat Genet 1999, 21:385-389. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

35.

Stephens RS, Kalman S, Lammel C, Fan J, Marathe R, Aravind L, Mitchell W, Olinger L, Tatusov RL, Zhao Q, et al.: Genome sequence of an obligate intracellular pathogen of humans: Chlamydia trachomatis.

Science 1998, 282:754-759. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

36.

White O, Eisen JA, Heidelberg JF, Hickey EK, Peterson JD, Dodson RJ, Haft DH, Gwinn ML, Nelson WC, Richardson DL, et al.: Genome sequence of the radioresistant bacterium Deinococcus radiodurans R1.

Science 1999, 286:1571-1577. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

37.

Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM, et al.: Whole-genome random sequencing and assembly of Haemophilus influenzae Rd.

Science 1995, 269:496-512. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

38.

Tomb JF, White O, Kerlavage AR, Clayton RA, Sutton GG, Fleischmann RD, Ketchum KA, Klenk HP, Gill S, Dougherty BA, et al.: The complete genome sequence of the gastric pathogen Helicobacter pylori.

Nature 1997, 388:539-547. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

39.

Alm RA, Ling LS, Moir DT, King BL, Brown ED, Doig PC, Smith DR, Noonan B, Guild BC, deJonge BL, et al.: Genomic-sequence comparison of two unrelated isolates of the human gastric pathogen Helicobacter pylori.

Nature 1999, 397:176-180. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

40.

Smith DR, Doucette-Stamm LA, Deloughery C, Lee H, Dubois J, Aldredge T, Bashirzadeh R, Blakely D, Cook R, Gilbert K, et al.: Complete genome sequence of Methanobacterium thermoautotrophicum DH: functional analysis and comparative genomics.

J Bacteriol 1996, 179:7135-7155.

Return to citation in text: [1]

41.

Bult CJ, White O, Olsen GJ, Zhou L, Fleischmann RD, Sutton GG, Blake JA, Fitzgerald LM, Clayton RA, Gocayne JD, et al.: Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii.

Science 1996, 273:1058-1073. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

42.

Fraser CM, Gocayne JD, White O, Adams MD, Clayton RA, Fleischmann RD, Bult CJ, Kerlavage AR, Sutton G, Kelley JM, et al.: The minimal gene complement of Mycoplasma genitalium.

Science 1995, 270:397-403. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

43.

Himmelreich R, Hilbert H, Plagens H, Pirkl E, Li BC, Herrmann R: Complete sequence analysis of the genome of the bacterium Mycoplasma pneumoniae.

Nucleic Acids Res 1996, 24:4420-4449. [PubMed Abstract] [Publisher Full Text] [PubMed Central Full Text]

Return to citation in text: [1]

44.

Parkhill J, Achtman M, James KD, Bentley SD, Churcher C, Klee SR, Morelli G, Basham D, Brown D, Chillingworth T, et al.: Complete DNA sequence of a serogroup A strain of Neisseria meningitidis Z2491.

Nature 2000, 404:502-506. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

45.

Kawarabayasi Y, Sawada M, Horikawa H, Haikawa Y, Hino Y, Yamamoto S, Sekine M, Baba S, Kosugi H, Hosoyama A, et al.: Complete sequence and gene organization of the genome of a hyperthermophilic archaebacterium, Pyrococcus horikoshii OT3.

DNA Res 1998, 5:55-76. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

46.

Andersson SG, Zomorodipour A, Andersson JO, Sicheritz-Ponten T, Alsmark UC, Podowski RM, Naslund AK, Eriksson AS, Winkler HH, Kurland CG: The genome sequence of Rickettsia prowazekii and the origin of mitochondria.

Nature 1998, 396:133-140. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

47.

Kaneko T, Sato S, Kotani H, Tanaka A, Asamizu E, Nakamura Y, Miyajima N, Hirosawa M, Sugiura M, Sasamoto S, et al.: Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions.

DNA Res 1996, 3:109-136. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

48.

Nelson KE, Clayton RA, Gill SR, Gwinn ML, Dodson RJ, Haft DH, Hickey EK, Peterson JD, Nelson WC, Ketchum KA, et al.: Evidence for lateral gene transfer between Archaea and bacteria from genome sequence of Thermotoga maritima.

Nature 1999, 399:323-329. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

49.

Fraser CM, Norris SJ, Weinstock GM, White O, Sutton GG, Dodson R, Gwinn M, Hickey EK, Clayton R, Ketchum KA, et al.: Complete genome sequence of Treponema pallidum, the syphilis spirochete.

Science 1998, 281:375-388. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

50.

Pearson WR: Flexible sequence similarity searching with the FASTA3 program package.

Methods Mol Biol 2000, 132:185-219. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

51.

Marsh RC, Worcel A: A DNA fragment containing the origin of replication of the Escherichia coli chromosome.

Proc Natl Acad Sci USA 1977, 74:2720-2724. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

52.

Salazar L, Fsihi H, de Rossi E, Riccardi G, Rios C, Cole ST, Takiff HE: Organization of the origins of replication of the chromosomes of Mycobacterium smegmatis, Mycobacterium leprae and Mycobacterium tuberculosis and isolation of a functional origin from M. smegmatis.

Mol Microbiol 1996, 20:283-293. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

53.

Lobry JR: Asymmetric substitution patterns in the two DNA strands of bacteria.

Mol Biol Evol 1996, 13:660-665. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

54.

Salzberg SL, Salzberg AJ, Kerlavage AR, Tomb JF: Skewed oligomers and origins of replication.

Gene 1998, 217:57-67. [PubMed Abstract] [Publisher Full Text]

Return to citation in text: [1]

Environmental Shotgun Sequencing: Its Potential and Challenges for Studying the Hidden World of Microbes. PLoS Biol 5(3): e82

I am posting here my recent paper that just came out in PLoS Biology on Environmental Shotgun Sequencing. With PLoS’s Creative Commons license I am allowed to do this, which makes me happy. The citation is Eisen JA (2007) Environmental Shotgun Sequencing: Its Potential and Challenges for Studying the Hidden World of Microbes. PLoS Biol 5(3): e82 doi:10.1371/journal.pbio.0050082

Environmental Shotgun Sequencing: Its Potential and Challenges for Studying the Hidden World of Microbes

Jonathan A. Eisen

Citation: Eisen JA (2007) Environmental Shotgun Sequencing: Its Potential and Challenges for Studying the Hidden World of Microbes. PLoS Biol 5(3): e82 doi:10.1371/journal.pbio.0050082

Published: March 13, 2007

Copyright: © 2007 Jonathan A. Eisen. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Abbreviations: ESS, environmental shotgun sequencing; PCR, polymerase chain reaction; rRNA, ribosomal RNA

Jonathan A. Eisen is at the University of California Davis Genome Center, with joint appointments in the Section of Evolution and Ecology and the Department of Medical Microbiology and Immunology, Davis, California, United States of America. Web site: http://phylogenomics.blogspot.com. E-mail:jaeisen@ucdavis.edu

Series Editor: Simon Levin, Princeton University, United States of America

This article is part of the Oceanic Metagenomics collection in PLoS Biology. The full collection is available online at http://collections.plos.org/plosbiology/gos-2007.php.


Since their discovery in the 1670s by Anton van Leeuwenhoek, an incredible amount has been learned about microorganisms and their importance to human health, agriculture, industry, ecosystem functioning, global biogeochemical cycles, and the origin and evolution of life. Nevertheless, it is what is not known that is most astonishing. For example, though there are certainly at least 10 million species of bacteria, only a few thousand have been formally described [1]. This contrasts with the more than 350,000 described species of beetles [2]. This is one of many examples indicative of the general difficulties encountered in studying organisms that we cannot readily see or collect in large samples for future analyses. It is thus not surprising that most major advances in microbiology can be traced to methodological advances rather than scientific discoveries per se.

Examples of these key revolutionary methods (Table 1) include the use of microscopes to view microbial cells, the growth of single types of organisms in the lab in isolation from other types (culturing), the comparison of ribosomal RNA (rRNA) genes to construct the first tree of life that included microbes [3], the use of the polymerase chain reaction (PCR) [4] to clone rRNA genes from organisms without culturing them [5–7], and the use of high-throughput “shotgun” methods to sequence the genomes of cultured species [8]. We are now in the midst of another such revolution—this one driven by the use of genome sequencing methods to study microbes directly in their natural habitats, an approach known as metagenomics, environmental genomics, or community genomics [9].

Table 1.

Some Major Methods for Studying Individual Microbes Found in the Environment

In this essay I focus on one particularly promising area of metagenomics—the use of shotgun genome methods to sequence random fragments of DNA from microbes in an environmental sample. The randomness and breadth of this environmental shotgun sequencing (ESS)—first used only a few years ago [10,11] and now being used to assay every microbial system imaginable from the human gut [12] to waste water sludge [13]—has the potential to reveal novel and fundamental insights into the hidden world of microbes and their impact on our world. However, the complexity of analysis required to realize this potential poses unique interdisciplinary challenges, challenges that make the approach both fascinating and frustrating in equal measure.

Who Is Out There? Typing and Counting Microbes in the Environment

One of the most important and conceptually straightforward steps in studying any ecosystem involves cataloging the types of organisms and the numbers of each type. For a long time, such typing and counting was an almost insurmountable problem in microbiology. This is largely because physical appearance does not provide a valid taxonomic picture in microbes. Appearance evolves so rapidly that two closely related taxa may look wildly different and two distantly related taxa may look the same. This vexing problem was partially overcome in the 1980s through the use of rRNA-PCR (Table 1). This method allows microorganisms in a sample to be phylogenetically typed and counted based on the sequence of their rRNA genes, genes that are present in all cell-based organisms. In essence, a database of rRNA sequences [14,15] from known organisms functions like a bird field guide, and finding a rRNA-PCR product is akin to seeing a bird through binoculars. Rather than counting species, this approach focuses on “phylotypes,” which are defined as organisms whose rRNA sequences are very similar to each other (a cutoff of >97% or >99% identical is frequently used). The ability to use phylotyping to determine who was out there in any microbial sample has revolutionized environmental microbiology [16], led to many discoveries [e.g., 17], and convinced many people (myself included) to become microbiologists.

The selective targeting of a single gene makes rRNA-PCR an efficient method for deep community sampling [18]. However, this efficiency comes with limitations, most of which are complemented or circumvented by the randomness and breadth of ESS. For example, examination of the random samples of rRNA sequences obtained through ESS has already led to the discovery of new taxa—taxa that were completely missed by PCR because of its inability to sample all taxa equally well (e.g., [19]). In addition, ESS provides the first robust sampling of genes other than rRNA, and many of these genes can be more useful for some aspects of typing and counting. Some universal protein coding genes are better than rRNA both for distinguishing closely related strains (because of third position variation in codons) and for estimating numbers of individuals (because they vary less in copy number between species than do rRNA genes) [10]. Perhaps most significantly, ESS is providing groundbreaking insights into the diversity of viruses [20,21], which lack rRNA genes and thus were left out of the previous revolution.

Certainly, many challenges remain before we can fully realize the potential of ESS for the typing and counting of species, including making automated yet accurate phylogenetic trees of every gene, determining which genes are most useful for which taxa, combining data from different genes even when we do not know if they come from the same organisms, building up databases of genes other than rRNA, and making up for the lack of depth of sampling. If these challenges are met, ESS has the potential to rewrite much of what we thought we knew about the phylogenetic diversity of microbial life.

What Are They Doing? Top Down and Bottom Up Approaches to Understanding Functions in Communities

A community is, of course, more than a list of types of organisms. One approach to understanding the properties and functioning of a microbial community is to start with studies of the different types of organisms and build up from these individuals to the community. Ideally, to do this one would culture each of the phylotypes and study its properties in the lab. Unfortunately, many, if not most, key microbes have not yet been cultured [22]. Thus, for many years, the only alternative was to make predictions about the biology of particular phylotypes based on what was known about related organisms. Unfortunately, this too does not work well for microbes since very closely related organisms frequently have major biological differences. For example, Escherichia coli K12 and E. coli O157:H7 are strains of the same species (and considered to be the same phylotype), with genomes containing only about 4,000 genes, yet each possesses hundreds of functionally important genes not seen in the other strain [23]. Such differences are routine in microbes, and thus one cannot make any useful inferences about what particular phylotypes are doing (e.g., type of metabolism, growth properties, role in nutrient cycling, or pathogenicity) based on the activities of their relatives.

These difficulties—the inability to culture most microbes and the functional disparities between close relatives—led to one of the first kinds of metagenomic analyses, wherein predictions of function were made from analysis of the sequence of large DNA fragments from representatives of known phylotypes. This approach has provided some stunning insights, such as the discovery of a novel form of phototrophy in the oceans [24]. However, this large insert approach has the same limitation as predicting properties from characterized relatives—a single cell cannot possibly represent the biological functions of all members of a phylotype.

ESS provides an alternative, more global way of assessing biological functions in microbial communities. As when using the large insert approach, functions can be predicted from sequences. However, in this case the predicted functions represent a random sampling of those encoded in the genomes of all the organisms present. This approach has unquestionably been wildly successful in terms of gene discovery. For example, analysis of ESS data has revealed novel forms of every type of gene family examined, as well as a great number of completely novel families (e.g., [25]). However, there is a major caveat when using ESS data to make community-level inferences. Ecosystems are more than just a bag of genes—they are made up of compartments (e.g., cells, chromosomes, and species), and these compartments matter. The key challenge in analyzing ESS data is to sort the DNA fragments (which are usually less than 1,000 base pairs long relative to genome sizes of millions or billions of bases) into bins that correspond to compartments in the system being studied.

A recent study by myself and colleagues illustrates the importance of compartments when interpreting ESS data. When we analyzed ESS data from symbionts living inside the gut of the glassy-winged sharpshooter (an insect that has a nutrient-limited diet), we were able to bin the data to two distinct symbionts [26]. We then could infer from those data that one of the symbionts synthesizes amino acids for the host while the other synthesizes the needed vitamins and cofactors. Modeling and understanding of this ecosystem are greatly enhanced by the demonstration of this complementary division of labor, in comparison to simply knowing that amino acids, vitamins, and cofactors are made by “symbionts.”

How does one go about binning ESS data? A variety of approaches have been developed, some of which are described in Table 2. In considering the different binning methods and their limitations, the first question one needs to ask is, what are we trying to bin? Is it fragments from the same chromosome from a single cell, which would be useful for studying chromosome structure? If so, then perhaps genome assembly methods are the best. What if instead, as in the sharpshooter example, we are trying to have each bin include every fragment that came from a particular species, knowledge which may be useful for predicting community metabolic potential? If the level of genetic polymorphism among individual cells from the same species is high, then genome assembly methods may not work well (the polymorphisms will break up assemblies). A better approach might be to look for species-specific “word” frequencies in the DNA, such as ones created by patterns in codon usage. The challenge is, how do we tune the methods to find the right target level of resolution? If we are too stringent, most bins will include only a few fragments. But if we are too relaxed, we will create artificial constructs that may prove biologically misleading, such as grouping together sequences from different species. To make matters more complex, most likely the stringency needed will vary for different taxa present in the sample.

Table 2.

Methods of Binning

Another critical issue is the diversity of the system under study. Generally, binning works better when there are few different phylotypes present, all of which are distantly related and form discrete populations. This is why binning works well for the sharpshooter system and other relatively isolated, low diversity environments. Binning increases in difficulty exponentially as the number of species increases: the populations and species start to merge together, and the populations get more and more polymorphic and variable in relative abundance (such as in the paper about the Global Ocean Sampling expedition in this issue [27]). Further complicating binning is the phenomenon of lateral gene transfer, where genes are exchanged between distantly related lineages at rates that are high enough that random sampling of a genome will frequently include genes with multiple histories.

Despite these challenges, I believe we can develop effective binning methods for complex communities. First, we can combine different approaches together, such as using one method to sort in a relaxed manner and then using another to subdivide the bins provided by the first method. Second, we can incorporate new approaches such as population genetics into the analysis [28]. In addition, the lessons learned here can be applied to other aspects of metagenomics (e.g., the counting and typing discussed above) and provide insights into the nature of microbial genomes and the structure of microbial populations and communities.

Comparative Metagenomics

So far, I have discussed issues relating mostly to intrasample analysis of ESS data. However, the area with perhaps the most promise involves the comparative analysis of different samples. This work parallels the comparative analysis of genomes of cultured species. Initial studies of that type compared distantly related taxa with enormous biological differences. What has been learned from these studies pertains mostly to core housekeeping functions, such as translation and DNA metabolism, and to other very ancient processes [29,30]. It was not until comparisons were made between closely related organisms that we began to understand events that occurred on shorter time scales, such as selection, gene transfer, and mutation processes [31]. Similarly, the initial comparisons of ESS data involved comparisons of wildly different environments [32], yielding insights into the general structure of communities. But as more comparisons are made between similar communities [33,34], such as those sampled during vertical and horizontal ocean transects [27,35–37], we will begin to learn about shorter time scale processes such as migration, speciation, extinction, responses to disturbance, and succession. It is from a combination of both approaches—comparing both similar and very divergent communities—that we will be able to understand the fundamental rules of microbial ecology and how they relate to ecological principles seen in macro-organisms.

Conclusions

In promoting some of the exciting opportunities with ESS, I do not want to give the impression that it is flawless. It is helpful in this respect to compare ESS to the Internet. As with the Internet, ESS is a global portal for looking at what occurs in a previously hidden world. Making sense of it requires one to sort through massive, random, fragmented collections of bits of information. Such searches need to be done with caution because any time you analyze such a large amount of data patterns can be found. In addition, as with the Internet, there is certainly some hype associated with ESS that gives relatively trivial findings more attention than they deserve. Overall, though, I believe the hype is deserved. As long as we treat ESS as a strong complement to existing methods, and we build the tools and databases necessary for people to use the information, it will live up to its revolutionary potential.

Acknowledgments

I thank Simon Levin, Joshua Weitz, Jonathan Dushoff, Maria-Inés Benito, Doug Rusch, Aaron Halpern, and Shibu Yooseph for helpful discussions, and Melinda Simmons, Merry Youle, and three anonymous reviewers for helpful comments on the manuscript. The writing of this paper was supported by National Science Foundation Assembling the Tree of Life Grant 0228651 to Jonathan A. Eisen and by the Defense Advanced Research Projects Agency under grants HR0011-05-1-0057 and FA9550-06-1-0478.

References

  1. Gould SJ (1996) Full house: The spread of excellence from Plato to Darwin New York: Harmony Books. 244–p p.
  2. Evans AV, Bellamy CL (1996) An inordinate fondness for beetles New York: Holt. 208–p p.
  3. Woese C, Fox G (1977) Phylogenetic structure of the prokaryotic domain: The primary kingdoms. Proc Natl Acad Sci U S A 74: 5088–5090. Find this article online
  4. Mullis K, Faloona F (1987) Specific synthesis of DNA in vitro via a polymerase-catalyzed chain reaction. Methods Enzymol 155: 335–350. Find this article online
  5. Reysenbach AL, Giver LJ, Wickham GS, Pace NR (1992) Differential amplification of rRNA genes by polymerase chain reaction. Appl Environ Microbiol 58: 3417–3418. Find this article online
  6. Medlin L, Elwood HJ, Stickel S, Sogin ML (1988) The characterization of enzymatically amplified eukaryotic 16S-like ribosomal RNA-coding regions. Gene 71: 491–500. Find this article online
  7. Weisburg W, Barns S, Pelletier D, Lane D (1991) 16S ribosomal DNA amplification for phylogenetic study. J Bacteriol 173: 697–703. Find this article online
  8. Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, et al. (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269: 496–512. Find this article online
  9. Handelsman J (2004) Metagenomics: Application of genomics to uncultured microorganisms. Microbiol Mol Biol Rev 68: 669–685. Find this article online
  10. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, et al. (2004) Environmental genome shotgun sequencing of the Sargasso Sea. Science 304: 66–74. Find this article online
  11. Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, et al. (2004) Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428: 37–43. Find this article online
  12. Gill SR, Pop M, Deboy RT, Eckburg PB, Turnbaugh PJ, et al. (2006) Metagenomic analysis of the human distal gut microbiome. Science 312: 1355–1359. Find this article online
  13. Garcia Martin H, Ivanova N, Kunin V, Warnecke F, Barry KW, et al. (2006) Metagenomic analysis of two enhanced biological phosphorus removal (EBPR) sludge communities. Nat Biotechnol 24: 1263–1269. Find this article online
  14. Olsen GJ, Larsen N, Woese CR (1991) The ribosomal RNA database project. Nucleic Acids Res 19: 2017–2021. Find this article online
  15. Cole JR, Chai B, Farris RJ, Wang Q, Kulam-Syed-Mohideen AS, et al. (2007) The ribosomal database project (RDP-II): Introducing myRDP space and quality controlled public data. Nucleic Acids Res 35: D169–D172. Find this article online
  16. Pace NR (1997) A molecular view of microbial diversity and the biosphere. Science 276: 734–740. Find this article online
  17. Hugenholtz P, Pitulle C, Hershberger KL, Pace NR (1998) Novel division level bacterial diversity in a Yellowstone hot spring. J Bacteriol 180: 366–376. Find this article online
  18. Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, et al. (2006) Microbial diversity in the deep sea and the underexplored “rare biosphere” Proc Natl Acad Sci U S A 103: 12115–12120. Find this article online
  19. Baker BJ, Tyson GW, Webb RI, Flanagan J, Hugenholtz P, et al. (2006) Lineages of acidophilic archaea revealed by community genomic analysis. Science 314: 1933–1935. Find this article online
  20. Angly FE, Felts B, Breitbart M, Salamon P, Edwards RA, et al. (2006) The marine viromes of four oceanic regions. PLoS Biol 4: e368 doi:10.1371/journal.pbio.0040368. Find this article online
  21. Edwards RA, Rohwer F (2005) Viral metagenomics. Nat Rev Microbiol 3: 504–510. Find this article online
  22. Leadbetter JR (2003) Cultivation of recalcitrant microbes: Cells are alive, well and revealing their secrets in the 21st century laboratory. Curr Opin Microbiol 6: 274–281. Find this article online
  23. Perna NT, Plunkett G 3rd, Burland V, Mau B, Glasner JD, et al. (2001) Genome sequence of enterohaemorrhagic Escherichia coli O157:H7. Nature 409: 529–533. Find this article online
  24. Beja O, Aravind L, Koonin EV, Suzuki MT, Hadd A, et al. (2000) Bacterial rhodopsin: Evidence for a new type of phototrophy in the sea. Science 289: 1902–1906. Find this article online
  25. Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, et al. (2007) The Sorcerer II Global Ocean Sampling expedition: Expanding the universe of protein families. PLoS Biol 5: e16 DOI: 10.1371/journal.pbio.0050016. Find this article online
  26. Wu D, Daugherty SC, Van Aken SE, Pai GH, Watkins KL, et al. (2006) Metabolic complementarity and genomics of the dual bacterial symbiosis of sharpshooters. PLoS Biol 4: e188 doi:10.1371/journal.pbio.0040188. Find this article online
  27. Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, et al. (2007) The Sorcerer II Gobal Ocean Sampling expedition: Northwest Atlantic through Eastern Tropical Pacific. PLoS Biol 5: e77 doi:10.1371/journal.pbio.0050077. Find this article online
  28. Johnson PL, Slatkin M (2006) Inference of population genetic parameters in metagenomics: A clean look at messy data. Genome Res 16: 1320–1327. Find this article online
  29. Koonin EV, Mushegian AR (1996) Complete genome sequences of cellular life forms: Glimpses of theoretical evolutionary genomics. Curr Opin Genet Dev 6: 757–762. Find this article online
  30. Mushegian AR, Koonin EV (1996) A minimal gene set for cellular life derived by comparison of complete bacterial genomes. Proc Natl Acad Sci U S A 93: 10268–10273. Find this article online
  31. Eisen JA (2001) Gastrogenomics. Nature 409: 463–465 465–466. Find this article online
  32. Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, et al. (2005) Comparative metagenomics of microbial communities. Science 308: 554–557. Find this article online
  33. Edwards RA, Rodriguez-Brito B, Wegley L, Haynes M, Breitbart M, et al. (2006) Using pyrosequencing to shed light on deep mine microbial ecology. BMC Genomics 7: 57. Find this article online
  34. Rodriguez-Brito B, Rohwer F, Edwards RA (2006) An application of statistics to comparative metagenomics. BMC Bioinformatics 7: 162. Find this article online
  35. DeLong EF (2005) Microbial community genomics in the ocean. Nat Rev Microbiol 3: 459–469. Find this article online
  36. DeLong EF, Preston CM, Mincer T, Rich V, Hallam SJ, et al. (2006) Community genomics among stratified microbial assemblages in the ocean’s interior. Science 311: 496–503. Find this article online
  37. Worden AZ, Cuvelier ML, Bartlett DH (2006) In-depth analyses of marine microbial community genomics. Trends Microbiol 14: 331–336. Find this article online