Convoluted title, cool paper in #PLoSGenetics on relative of insect mutualists causing a human infection

Saw this tweet a few minutes ago:

The title of the paper took me a reread or two to understand.  But once I got what they were trying to say I was intrigued.  And so I went to the paper:  PLOS Genetics: A Novel Human-Infection-Derived Bacterium Provides Insights into the Evolutionary Origins of Mutualistic Insect–Bacterial Symbioses.  And it is loaded with interesting tidbits.  First, the first section of the results details the history of the infection in a 71 year old male and his recovery and the isolation and characterization of a new bacterial strain.  Phylogenetic analysis revealed this was a close relative of the Sodalis endosymbionts of insects.

And then comparative genomics revealed a bit more detail about the history of this strain, it’s relatives, and some of the insect endosymbionts.  And plus, it allowed the authors to make some jazzy figures such as

And this and other comparative analyses revealed some interesting findings.  As summarize by the authors

Our results indicate that ancestral relatives of strain HS have served as progenitors for the independent descent of Sodalis-allied endosymbionts found in several insect hosts. Comparative analyses indicate that the gene inventories of the insect endosymbionts were independently derived from a common ancestral template through a combination of irreversible degenerative changes. Our results provide compelling support for the notion that mutualists evolve from pathogenic progenitors. They also elucidate the role of degenerative evolutionary processes in shaping the gene inventories of symbiotic bacteria at a very early stage in these mutualistic associations.

The paper is definitely worth a look.

SMBE Satellite Meeting on Mechanisms of Protein Evolution II

This meeting might be of interest for people in the lab:

We are pleased to announce the SMBE Satellite Meeting on Mechanisms of
Protein Evolution II: Thermodynamics, Phylogenetics, and Structure
(MPEII 2013), to take place at the University of Colorado Denver’s
Anschutz Medical Campus, February 7-9, 2013.

The meeting aims to broadly cover the interface of protein evolutionary
mechanisms, models of amino acid substitution, genomics/systems biology
and phylogenetics. Topics also include adaptation, coevolution,
convergence, neutral processes including mutation, prediction of
folding, prediction of mutational effects, the influence of
protein-protein interactions on protein evolution, and the interaction
of next-gen sequencing and model development. This is a small meeting,
with plenty of opportunity for interaction. Talks by students as well as
more senior scientists are encouraged, and there will be a poster
session this year in addition to talks. This meeting is also partially
sponsored by BMC Evolutionary Biology and the UC Denver Department of
Biochemistry & Molecular Genetics, Program in Computational Bioscience,
and Consortium for Comparative Genomics.

Confirmed invited speakers include:
Belinda Chang, University of Toronto
Andy Clark, Cornell University
Richard Goldstein, National Institute of Medical Research (UK)
Nicolas Lartillot, University of Montreal
David Liberles University of Wyoming
Michael Lynch, Indiana University
James McInerney, National University of Ireland, Maynooth
Mary O’Connell, Dublin City University
David Pollock, University of Colorado School of Medicine
Jeff Thorne, North Carolina State University
Naomi Ward, University of Wyoming

More information and registration can be found at The early registration deadline is
December 15, 2012. A ski trip at Copper Mountain (CO) is being planned
for attendees in the day(s) that follow the meeting. We hope you can
join us in Denver for this event.

David Pollock, James McInerney, and David Liberles

David Liberles <>

Adaptation and Convergence in Regulatory Systems, guest post by David Pollock

Below is another in my series on “The Story Behind the Paper” with guest posts from authors of open access papers of relevance to this blog.  Today’s post comes from David Pollock in Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine.  Note – I went to graduate school with David and he is a long time friend.  This is why he apparently feels okay to call me Jon even though I go by Jonathan.  I have modified his text only to format it.  Plus I added some figures from his paper.

Adaptation and Convergence in Regulatory Systems

Guest post by David Pollock


This is a guest blog post about a paper of mine just published in Genome Biology and Evolution, “SP transcription factor paralogs and DNA bindingsites coevolve and adaptively converge in mammals and birds“, and it amounts to my first attempt at an un-press release. You can tell it’s not a press release because I’m writing this about a week after the paper came out. This guest blog is coming about because I released an announcement about the paper to a few friends, and credited my old buddy Jon Eisen with having inspired me to move towards avoiding press releases. And I sure wouldn’t want this paper to be named in the same breath as the now infamous arsenic bacteria and Encode project press releases (although I note that a recent paper from my lab estimates that the human genome is at least 2/3 derived from repetitive sequence); also see posts by Michael Eisen and by Larry Moran. I had the vague idea that maybe Jon, Ed Yong, Carl Zimmer, or some other famous blogger would be so inspired by the move (and the fundamental coolness of the paper) that they would quickly write up a thoughtful interpretive summary for general audiences. Jon, however, being much to smart for my own good, suggested that I should write it up as a guest blog instead. So, here I am. At least I got Jon to agree to edit it.

The fundamentals of the paper

Okay then, why do I think adaptation and convergence in regulatory systems is cool and important? Well, first because I think a lot of important changes in evolution have to have come about through regulatory evolution, and yet there are huge gaps in our knowledge of how this change might happen. And I say this as someone who has spent most of his career studying (and still believing in the importance of) sequence evolution. Second, a lot of people seem to think that evolution of whole regulatory systems should be hard, because there are so many interactions that would need to change at the same time. Remember, transcription factors can interact with hundreds of different binding sites to regulate hundreds of different proteins. It makes sense that evolution of such a system should be hard. In this paper, I think we go a long way towards demonstrating that this intuitive sense is wrong, that functional evolution of regulatory systems can happen quite easily, that it has happened in a system with around a thousand functional binding sites, and that some of the details of how it happens are really interesting.

Fig. 1. Evolution of SP transcription factors. (A) SP1 binds preferentially to the GC box in placental mammals and birds (red) and to the ancestral GA box consensus in other vertebrates (black). Modifications in binding motif preferences along the phylogeny are denoted by red-filled circles. ‘Variable regions’ in zinc finger 2 (zf2-VR), containing all non-conserved sites in zinc finger 2 within vertebrates, are shown for SP1, SP3, and SP4. Site –13 (highlighted) is putatively responsible for the change in SP1 binding preferences. (B) Zinc finger 2 (zf2) of human SP1, SP3, and SP4. Each zinc finger contains an alpha-helix and two beta sheets (Philipsen et al. 1999; Dhanasekaran et al. 2006). Red and gray columns denote sites non- conserved across vertebrates; all are contained in the boxed variable region (zf2-VR), comprising sites -13 to -8. Site +3 binds directly to the convergent A/C fourth site of the GC box. (C) SP1 binds to the DNA via zinc fingers 1-3 (zf1-zf3), where zf2 binds to the three central nucleotides of the GC box (GGGCGG) (Philipsen et al. 1999; Bouwman et al. 2002; Dhanasekaran et al. 2006). Site -13 (red) is only 9.5 Å from site +3 (green) and directly contacts the neighboring site (site +4) (Bouwman et al. 2002; Oka et al. 2004; Dhanasekaran et al. 2006).

Aside from promoting the science, the other reason I want to blog about this paper is that I think it is a great demonstration of how fun and how diverse science can be. To support its points, it brings in many different types of evidence, from genomics to population genetics to protein structure prediction. It is also a good example of using only publicly available data, plus a lot of novel analysis, to see something interesting that was just sitting there. I think there must be a lot more stories like that. All of that Encode data, for example, is bound to have some interesting undiscovered stories, even with 30 papers already published. The most fun part, though, which I don’t think I can fully recreate without jumping up and down in front of you, was just the thrill of discovery, and the thrill of having so many predictions fall into place with data from so many different sources.

I don’t recall another project where we would say so many times, “well, if that is the explanation, then let’s look at this other thing,” and bam!, we look at the other thing and it fits in too. It started with Ken Yokoyama, the first author, walking into the lab having just published some pretty good evidence that the preference for the SP1-associated binding site (the GC-box) was newly evolved in the ancestor to eutherian mammals. Well, if that’s true, there ought to be a change in the SP1 protein sequence that can explain it. Sure enough, there is, and SP1 is a very conserved protein that doesn’t change a lot. Hmm, we have more sequences now, let’s look to see if preferences changed anywhere else on the phylogenetic tree. Yes, in birds. Well, there ought to be a change in bird SP1 that can explain that; sure enough, there is, and it’s at the homologous position in the protein. Looking good, but is it in the right place in the protein? Yes, in the right domain (zinc finger 2, or zf2), right behind the alpha helix that binds the nucleotide for which the preference changed. And before you ask, Ken ran a protein structure prediction algorithm on the amino acid replacements in SP1, and the predicted functional replacements in bird and mammal are predicted to bend the protein right at the point where it binds the nucleotide at which the preference changed. You might then ask if this amino acid replacement does anything to the binding function, and the answer again is “yes”. This time, though, we were able to rely on existing functional studies, which showed that human SP1 binds 3x better to the GC-box binding site than it does to the ancestral GA-box (more on this below).

Fig. 2. Birth-death rates of the SP1 binding motif in mammals. Birth rates (α) denote the probability (per year) that an unoccupied position will gain a binding site; death rates (β) give the probability (per year) that an existing binding site is lost. Branches in the mammalian phylogeny were partitioned into three groups: early eutherian mammals (red), late eutherian mammals (black), and GA box-preferring non- eutherian mammals (blue). Birth and death rates of each group were estimated for the GC box (GGGCGG), GA box (GGGAGG), and the non-functional motif GGGTGG (Letovsky et al. 1989; Wierstra 2008).

The coup de grace on this residue position as the source of convergent functional changes, though, came with consideration of the other transcription factors that interact in this regulatory system (that is, they bind to the same binding sites to modify transcription). If there was a functional change in the transcription factor, driving modified changes in the binding sites, then it seems that this should affect the other transcription factors in the system. It could have been hard to figure anything out about these other transcription factors, but luckily they consist of SP3 and SP4, two paralogs of SP1. This means that they are ancient duplicates of ancestral SP proteins, they share a great deal of conserved sequence with SP1, and they bind with similar affinities to the SP1 binding consensus. And they have not just one or two, but between the two proteins, in birds and mammals, at least eight convergent amino acid replacements at the homologous position that putatively modified binding in SP1. And the substitution that occurs is the same replacement that occurred in bird SP1. Based on sequences from jawed fish and frogs, this position was almost completely conserved in the SP3 and SP4 paralogs for 360 or 450 million years of evolution. The convergent changes all occurred in only the last 100 million years or less of eutherian and bird lineages. We believe that the simplest interpretation is that, over tens of millions of years, a functional replacement occurred at the SP1 protein, adaptively driving hundreds of SP1 binding sites to convert from ancestral GA-boxes to derived GC-boxes, and that this then drove the same functional replacement in coregulatory paralogs SP3 and SP4.

Timing and a mechanism

Two questions often comes up at this point, “how do we know the order of these events?” The simplest piece of evidence for the order of events comes from the order of fixation of substitutions. The amino acid substitutions in SP1 are fixed in all eutherian mammals and all birds, indicating that they occurred on the branches leading to these taxon groups. The increase in GC-boxes occurred over time at different loci, mostly on the branch leading to eutherian mammals and on the branches immediately after that split the most ancient eutherian mammal groups. The replacements in SP3 and SP4 occurred later in the evolution of eutherian mammals and birds, and did not occur in all lineages. One might be able to come up with complicated scenarios whereby changes in some SP1 binding sites occurred first, driving the fixation of the SP1 replacement, followed by further selected changes in other SP1 binding sites, but we think our hypothesis is simpler.

Fig. 3. Population frequencies of an adaptive mutant transcription factor and its binding sites.  (NOTE – SOME DETAIL OF LEGEND LOST IN COPY/PASTE – SEE PAPER). (A) Shown are the population frequencies of the adaptive mutant transcription factor allele (blue), which first occurs in a single heterozygous individual at generation (population size:   ). The total population frequency of the novel binding consensus (BOXC) and the initial wild-type binding motif (BOXA) are shown in red and black, respectively. We assume a small adaptive benefit for the adaptive transcription factor SPC binding to BOXC (relative fitness , where ) over the wild-type transcription factor and its motif (relative fitness ). Maladaptive binding events (SPC binding to BOXA or the wild-type transcription factor binding to BOXC) have reduced fitness ( , where ). Population frequencies of SPC, BOXA, and BOXC are given on the left for the first 20,000 generations and on the right for 150,000 generations. (B) Evolution of the adaptive trans-factor and binding sites under a semi-dominant model. SPC binding to BOXC is assigned relative fitness for individuals heterozygous for the transcription factor genotype (     ) and for individuals homozygous for the mutant transcription factor. (C) The single binding site locus model. In contrast to the previous model, each locus is restricted to no more than one binding motif (either BOXA or BOXC).

Other pieces of evidence also come into play. The question about the order of events can be rephrased as a question of whether neutral forces, such as changes in mutation rates at binding sites, could have altered the frequency of the alternative binding sites, with SP1 (and then SP3 and SP4) playing functional catch-up to better match the new binding site frequencies. It seems to us that such a model would predict that the binding sites would have changed irrespective to the function of the proteins that they regulate. (As an aside here, we note that our binding site data set is best described not as a definitive set of SP1-regulated promoters, but as a set that is highly enriched in functional SP1 binding sites. We don’t trust binding site function predictions, and the putative binding sites inclusively considered were those that had either the ancestral GA-box or the derived GC-box in the functionally relevant region prior to the transcription start site. Such sites are highly enriched for categories of genes known to be under SP1 control.) But the binding sites that shift from GA-boxes to GC-boxes are even further enriched for categories of genes under SP1 control. This is not compatible with the neutral mutational shift model, but is compatible with the idea that the subset of our sites for which SP1 regulation is most important are the ones that were most likely to adaptively shift box type when the SP1 with altered function became more frequent.

The mutational driver model also predicts a simple shift in frequencies driven by mutation. For example, GA-boxes might tend to mutate into GC-boxes, and conversely, GC-boxes might tend to be conserved and not mutate to GA-boxes. What I haven’t told you yet, though, is that the excess GC-boxes do not tend to be produced by mutation from GA-boxes, but rather they tend to be produced as de novo mutations from non-SP1 box sequences. They are produced by a wide variety of mutations from a wide variety of different sequences that are slightly different from the canonical SP1 binding sites. Furthermore, the GC-boxes appear in a burst of birth early in eutherian evolution, but the GA-boxes don’t disappear in a burst at the same time. Rather, they simply fade away slowly over time in lineages that have evolved GC-boxes. It is not clear to us that this can be explained using a mutation model, but it is easily explained by a model in which the SP1 replacement has adaptively driven hundreds of binding site convergent events. This is then followed by the slow mutational degradation of the GA-boxes, which don’t matter so much to function anymore. It is also worth mentioning that the GC box preference doesn’t seem to correlate with GC content, as several fish lineages are just as high in GC content as humans, but do not have the GC-box preference.

Fig. 4. Structural changes of SP1 zinc finger 2 (zf2) following replacements at site -13. (Top) Comparisons of predicted lowest-energy zf2 structures between the native human peptide (-13M), and peptides following replacements to the ancestral valine (M-13V) and bird isoleucine (M-13I) at site -13. Structural alignments were conducted according to residues on the 5’ end of the peptide (residues -16 to -12). Both -13M and M-13I peptides showed displacement of residues 5’ to the DNA-contacting alpha-helix (sites -6 to -1) compared to the ancestral valine peptide. No such displacement was seen between -13M and M-13I. All three peptides aligned closely at the 3’ end of the alpha-helix (sites +6 to +10), reflecting structural modifications at the 5’ end of the alpha-helix. (Bottom) Distances between alpha carbons prior to and within the alpha-helix (blue and orange, respectively). Comparisons between the native human peptide and M-13V (left) and between M-13I and M-13V (center) show closely-aligned residues at the 3’ end of the alpha-helix and increasing displacement towards the 5’ end. These modifications begin around site +3, which directly contacts the A/C evolving site of the SP1 binding motif (Philipsen et al. 1999; Bouwman et al. 2002; Dhanasekaran et al. 2006). No such region- specific displacement between -13M and M-13I was observed between -13M and M-13I (right).

The observed pattern of binding site evolution is also predicted by a model we developed to determine if the evolution of transcription factors and their binding sites could be explained in a population genetics framework. We asked, what is a possible mechanism by which these changes might occur? At the beginning of this post, I noted that a lot of people seem to think that evolution of complex multi-genic regulatory systems should be hard. We reasoned, though, that if the beneficial effects of a newly evolved binding interaction were dominant or semi-dominant (that is, the beneficial effects in the heterozygote were at least partly visible to selection), then it might be possible for evolution to be achieved through a transition period in which both the transcription factor and its cognate binding sites were polymorphic.

We developed both a deterministic and a stochastic model, and found that, indeed, even small (nearly neutral) selective benefits per locus can drive the entire system to fixation. What happens is that as long as there are some binding loci with the new binding box in the system, then a new variant of SP1 with a preference for the new binding box, and an associated small selective benefit, will at first rapidly increase in frequency. It won’t immediately fix though, but rather will maintain a temporary steady state, kept down in frequency by the deleterious effects that new variant homozygotes would have with the large number of binding loci that are homozygous for the ancestral binding box. Once it has reached this steady state, it exerts selective pressure on all the binding loci to increase the frequencies of the new binding boxes at each locus. Much more slowly then, the frequency of the new transcription factor variant increases, in step with the frequencies of new binding boxes at all the binding loci.

Although our studies do not prove that our population genetics model is the exact mechanism for adaptive changes in SP1, it provides proof of concept that it is not difficult for such a mechanism to exist.

Where does this leave us?

At the broader level, this paper shows that small selective benefits can drive the evolution of complex regulatory systems (in diploids, at least; sorry to leave out the micro folks, Jon). Furthermore, it demonstrates, we believe convincingly, that adaptation has driven the evolution of the SP1 regulatory system, driving convergent evolution at many hundreds of promoters, and in SP3 and SP4. It thus strongly counters prevailing notions that such evolution is hard. We hope that this work (along with other work of this kind) will drive others to further pursue the broad questions in regulatory evolution. Are the details of the SP1 system common to other regulatory systems?

A particularly important question, which we did not focus on here, is whether the evolution we have described involves only static maintenance of the status quo in terms of which genes are regulated. One has to wonder, though, whether if it is easy to evolve a static regulatory system, that it is not therefore easier than previously believed to modify regulatory connections in a complex regulatory system. There are hints of such changes here, in that genes that may have gained novel SP1 regulation (that is, gained a GC-box when they did not have the ancestral GA-box) tend to be enriched in certain GO categories (see Table 2 in the paper).

For SP1, it will be interesting to see if good stories can be developed for to explain why this adaptation should have occurred specifically in birds and eutherian mammals. The ideal story should include both a biophysical mechanism, and a physiology-based mechanism, such as the possibility that warm-bloodedness played a role. Both of these avenues promise to be complicated, if addressed properly. For example, we believe that it will be more meaningful if a biophysical mechanism can address the need for specificity as well as strength of binding, perhaps by utilizing next-generation sequencing approaches to measure affinities for all relevant binding site mutations (see, among others, our recent paper on this topic, Pollock et al., 2011). Are there interactive roles for selection on transcription factor concentration as well as efficiency and selectivity? What trade-offs exist among binding efficiency and binding site duplication? Do different types of regulatory connections evolve differently? These are all great questions for future research.


I’ll try to add further comments if questions or issues come up. I’m particularly interested to see how this non-press release guest log post works as an experiment to promote the paper and the work. I also hope it will promote Ken Yokoyama’s career (he’s now at Illinois, and will probably be looking for an academic job in the next year or two). He did an awesomely diverse amount of work on this, learning how to work in totally new areas for him, such as population genetics, birth/death models, and protein structure prediction. He dove into these areas unhesitatingly to pursue the logical scientific questions, developed novel analyses, and did a great job. This paper represents a fundamental contribution and a fantastic advertisement for Ken’s abilities.

Some quick comments on "Giant viruses coexisted w/ the cellular ancestors & represent a distinct supergroup"

Got asked on Twitter about this paper:

BMC Evolutionary Biology | Abstract | Giant viruses coexisted with the cellular ancestors and represent a distinct supergroup along with superkingdoms Archaea, Bacteria and Eukarya

// I answered briefly

// Don’t have time for a detailed blog post but here are some quick comments:

1. Giant viruses are fascinating and cool

2. I have done work connected to the topic of this paper and thus might not be considered fully objective.  For example see

3. I see no evidence that the type of analysis that they do on protein folds is a robust phylogenetic method.  Phylogeny from sequence alignments (which is what we focus on in my lab) have been tested and tweaked for some 50 years.  There are 100s to maybe 1000s of papers on methods alone – not to mention the 1000s of papers using alignments for phylogenetics.  I am not convinced that the analysis being done here of FFs and FSFs is particularly robust.  It seems interesting, certainly.  But is it sound?  I mean, I could build phylogenetic trees from cell size, from shape, from eye color, and from all sorts of other features.  Those would all suck for certain.  Protein folds – not sure about them.  They almost certainly are prone to convergent evolution and I do not see any attempt in this analysis to deal with that issue.
4. The authors of the current paper do not show any taxa names on their trees – just colors for large groups of taxa (bacteria, archaea, eukaryotes and viruses).  It is really not good practice to remove the taxon names.  If they were there the first thing I would do is to look at the patterns within the groups they highlight.  Do all the major phyla / kingdoms of eukaryotes, for example, come out looking as one would expect based upon other studies.  Or are they all over the place?  Same for bacteria and archaea.  Not including taxa makes it nearly impossible to judge this paper positively.  I could not find this information in supplemental data either.
5. They really should have released the data tables they used for the phylogenetic analysis.  Don’t know why they did not.
6. In Figure 3 with the rooting they have, either viruses are a subgroup of archaea or archaea are not monophyletic.  Not a good thing in a paper trying to claim viruses represent a fourth grouping on the tree of life.
Anyway – got to do some other things but just wanted to get some comments out there.

UPDATE 9/19 – some prior stories about the “fourth domain” and ancient viruses – to counter notion in the press release for this paper that their findings “shake up the tree of life”.  Even if their specific inferences about viral evolution are correct, such inferences / conclusions have been made before.

A blast from the past: Plasmodium, plastids, phylogeny, and reproducibility

A few days ago I got an email from a colleague who I had not seen in many years.  It was from Malcolm Gardner who worked at TIGR when I was there and is now at Seattle Biomed.

His email was related to the 2002 publication of the complete genome sequence of Plasmodium falciparum the causative agent of most human malaria cases –  for which he was the lead author.   Someone had emailed Malcolm asking if he could provide details about the settings used in the blast searches that were part of the evolutionary analyses of the paper.   The paper is freely available at Nature – at least for now – every once in a while the Nature Publishing Group seems to put it behind a paywall despite their promises not to.

Malcolm was contacting me because I had run / coordinated much of the evolutionary analysis reported in that paper.  I note – as one of the only evolution focused people at TIGR it was pretty common for people to come to me and ask if I could help them with their genome.  I pretty much always said yes since, well, I loved doing that kind of thing and it was really exciting in the early days of genome sequencing to be the first person to ask some evolution related question about the data.

Malcolm included the email he had received (which did not have a lot of detail) and he and I wrote back and forth trying to figure out exactly what this person wanted.  And then I said, well, maybe the person should get in touch with me directly so I can figure out what they really want/need.  It seemed unusual that someone was asking about something like that from a 10 year old paper, but, whatever.  

As I was communicating with this person, I started digging through my files and my brain trying to remember exactly what had been done for this paper more than 10 years ago.  I remember Malcolm and others from the Plasmodium community organizing some “jamborees” looking at the annotation of the genome. At one of those jamborees I met with some of the folks from the Sanger Center (which was one of the big players in the P. falciparum genome sequencing) with Malcolm and – after some discussion I ended up doing three main things relating to the paper, which I describe below.

Thing 1: Conserved eukaryote genes

One of my analyses was to use the genome to look for genes conserved in eukaryotes but not present in bacteria or archaea.  I did this to try and find genes that could be considered likely to have been invented on the evolutionary branch leading up to the common ancestor of eukaryotes.

As an aside, at about the same time I was asked to write a News and Views for Nature about the publication of the Schizosaccharomyces pombe genome.  In the N&V I had written “Genome sequencing: Brouhaha over the other yeast” I noted how the authors had used the genome to do some interesting analysis of conserved eukaryotic genes.  With the help of the Nature staff I had also made a figure which demonstrated (sort of) what they were trying to do in their analysis – which was to find genes that originated on the branch leading up to the common ancestor of the eukaryotes for which genomes were available at the time.  As another aside – the S. pombe genome paper and my News and Views article are freely available …

Figure 1: The tree of life, with the branches labelled according to Wood et al.’s analysis of genes that might be specific to eukaryotes versus prokaryotes, and to multicellular versus single-celled organisms. Bacteria and archaea are prokaryotes (they do not have nuclei). From Nature 415, 845-848 (21 February 2002) | doi:10.1038/nature725. The eukaryotic portion of the tree is based on Baldauf et al. 2000

Anyway, I did a similar analysis to what was in the S. pombe genome paper and I found a reasonable number and helped write a section for the paper on this.

Comparative genome analysis with other eukaryotes for which the complete genome is available (excluding the parasite E. cuniculi) revealed that, in terms of overall genome content, P. falciparum is slightly more similar to Arabidopsis thaliana than to other taxa. Although this is consistent with phylogenetic studies (64), it could also be due to the presence in the P. falciparum nuclear genome of genes derived from plastids or from the nuclear genome of the secondary endosymbiont. Thus the apparent affinity of Plasmodium and Arabidopsis might not reflect the true phylogenetic history of the P. falciparum lineage. Comparative genomic analysis was also used to identify genes apparently duplicated in the P. falciparum lineage since it split from the lineages represented by the other completed genomes (Supplementary Table B). 

There are 237 P. falciparum proteins with strong matches to proteins in all completed eukaryotic genomes but no matches to proteins, even at low stringency, in any complete prokaryotic proteome (Supplementary Table C). These proteins help to define the differences between eukaryotes and prokaryotes. Proteins in this list include those with roles in cytoskeleton construction and maintenance, chromatin packaging and modification, cell cycle regulation, intracellular signalling, transcription, translation, replication, and many proteins of unknown function. This list overlaps with, but is somewhat larger than, the list generated by an analysis of the S. pombe genome (65). The differences are probably due in part to the different stringencies used to identify the presence or absence of homologues in the two studies.

The list of genes is available as supplemental material on the Nature web site.  Alas it is in MS Word format which is not the most useful thing.  But more on that issue at the end of this post.

Thing 2. Searching for lineage specific duplications

Another aspect of comparative genomic analysis that I used to do for most genomes at TIGR was to look for lineage specific duplications (i.e., genes that have undergone duplications in the lineage of the species being studied to the exclusion of the lineages for which other genomes are available).  The quick and dirty way we used to do this was to simply look for genes that had a better blast match to another gene from their own genome than to genes in any other genome.  The list of genes we identified this way is also provided as a Word document in Supplemental materials.

Thing 3: Searching for organelle derived genes in the nuclear genome of P. falciparum

The third thing I did for the paper was to search for organelle derived genes in the nuclear genome of Plasmodium.  Specifically I was looking for genes derived from the mitochondrial genome and plastid genome.  For those who do not know, Plasmodium is a member of the Apicomplexa – all organisms in this group have an unusual organelle called the Apicoplast.  Though the exact nature of this organelle had been debated, it’s evolutionary origins were determined by none other than Malcolm Gardner many years earlier (Gardner et al. 1994). They had shown that this organelle was in fact derived from chloroplasts (which themselves are derived from cyanoabcteria).  I am shamed to say that before hanging out with Malcolm and talking about Plasmodium I did not know this.  This finding of a chloroplast in an evolutionary group of eukaryotes that are not particularly closely related to plants is one of the key pieces of evidence in the “secondary endosymbiosis” hypothesis which proposes that some eukaryotes have brought into themselves as an endosymbiont a single-celled photosynthetic algae which had a chloroplast.  
Anyway – here we were – with the first full genome of a member of the Apicomplexans group.  And we could use it to discover some new details on plastid evolution and secondary endosymbioses.  So I adapted some methods I had used in analyzing the Arabidopsis genome (see Lin et al. 1999 and AGI 2000), and searched for plastid derived genes in the nuclear genome of Plasmodium.  Why look in the nuclear genome for plastid genes?  Or mitochondrial genes for that matter.  Well, it turns out that genes that were once in the organelle genomes frequently move to the nuclear genome of their “host”.  In fact, a lot of genes move.  So – if you want to study the evolution of an organism’s organelles, it is sometimes more fruitful to look in the nuclear genome than in the actual organelle’s genome.  OK – now back to the Plasmodium genome.  What I was doing was trying to find genes in the nuclear that had once been in the plastid genome.  How would you look for these?  
To find mitochondrial-derived genes I did blast searches against the same database of genomes used to study the evolution of eukaryotes but for this I looked for genes in Plasmodium that has decent matches to genes in alpha proteobacteria.  And for those I then build phylogenetic trees of each gene and its homologs, then screened through all the trees to look for any in which the gene from Plasmodium grouped in a tree inside a clade with sequences from alpha proteobacteria (and allowed for mitochondrial genes from other eukaryotes to be in this clade).  
To find plastid derived genes I did a similar screen except instead searched for genes that grouped in evolutionary trees with genes from cyanobacteria (or eukaryotic genes that were from plastids).  The section of the paper that I helped write is below:

A large number of nuclear-encoded genes in most eukaryotic species trace their evolutionary origins to genes from organelles that have been transferred to the nucleus during the course of eukaryotic evolution. Similarity searches against other complete genomes were used to identify P. falciparum nuclear-encoded genes that may be derived from organellar genomes. Because similarity searches are not an ideal method for inferring evolutionary relatedness (66), phylogenetic analysis was used to gain a more accurate picture of the evolutionary history of these genes. Out of 200 candidates examined, 60 genes were identified as being of probable mitochondrial origin. The proteins encoded by these genes include many with known or expected mitochondrial functions (for example, the tricarboxylic acid (TCA) cycle, protein translation, oxidative damage protection, the synthesis of haem, ubiquinone and pyrimidines), as well as proteins of unknown function. Out of 300 candidates examined, 30 were identified as being of probable plastid origin, including genes with predicted roles in transcription and translation, protein cleavage and degradation, the synthesis of isoprenoids and fatty acids, and those encoding four subunits of the pyruvate dehydrogenase complex. The origin of many candidate organelle-derived genes could not be conclusively determined, in part due to the problems inherent in analysing genes of very high (A + T) content. Nevertheless, it appears likely that the total number of plastid-derived genes in P. falciparum will be significantly lower than that in the plant A. thaliana (estimated to be over 1,000). Phylogenetic analysis reveals that, as with the A. thaliana plastid, many of the genes predicted to be targeted to the apicoplast are apparently not of plastid origin. Of 333 putative apicoplast-targeted genes for which trees were constructed, only 26 could be assigned a probable plastid origin. In contrast, 35 were assigned a probable mitochondrial origin and another 85 might be of mitochondrial origin but are probably not of plastid origin (they group with eukaryotes that have not had plastids in their history, such as humans and fungi, but the relationship to mitochondrial ancestors is not clear). The apparent non-plastid origin of these genes could either be due to inaccuracies in the targeting predictions or to the co-option of genes derived from the mitochondria or the nucleus to function in the plastid, as has been shown to occur in some plant species (67).

Thing 4: Analysis of DNA repair genes 

Arnab Pain from the Sanger Center and I analyzed genes predicted to be involved in DNA repair and recombination processes and wrote a section for the paper:

DNA repair processes are involved in maintenance of genomic integrity in response to DNA damaging agents such as irradiation, chemicals and oxygen radicals, as well as errors in DNA metabolism such as misincorporation during DNA replication. The P. falciparum genome encodes at least some components of the major DNA repair processes that have been found in other eukaryotes (111, 112). The core of eukaryotic nucleotide excision repair is present (XPB/Rad25, XPG/Rad2, XPF/Rad1, XPD/Rad3, ERCC1) although some highly conserved proteins with more accessory roles could not be found (for example, XPA/Rad4, XPC). The same is true for homologous recombinational repair with core proteins such as MRE11, DMC1, Rad50 and Rad51 present but accessory proteins such as NBS1 and XRS2 not yet found. These accessory proteins tend to be poorly conserved and have not been found outside of animals or yeast, respectively, and thus may be either absent or difficult to identify in P. falciparum. However, it is interesting that Archaea possess many of the core proteins but not the accessory proteins for these repair processes, suggesting that many of the accessory eukaryotic repair proteins evolved after P. falciparum diverged from other eukaryotes. 

The presence of MutL and MutS homologues including possible orthologues of MSH2, MSH6, MLH1 and PMS1 suggests that P. falciparum can perform post-replication mismatch repair. Orthologues of MSH4 and MSH5, which are involved in meiotic crossing over in other eukaryotes, are apparently absent in P. falciparum. The repair of at least some damaged bases may be performed by the combined action of the four base excision repair glycosylase homologues and one of the apurinic/apyrimidinic (AP) endonucleases (homologues of Xth and Nfo are present). Experimental evidence suggests that this is done by the long-patch pathway (113). 

The presence of a class II photolyase homologue is intriguing, because it is not clear whether P. falciparum is exposed to significant amounts of ultraviolet irradiation during its life cycle. It is possible that this protein functions as a blue-light receptor instead of a photolyase, as do members of this gene family in some organisms such as humans. Perhaps most interesting is the apparent absence of homologues of any of the genes encoding enzymes known to be involved in non-homologous end joining (NHEJ) in eukaryotes (for example, Ku70, Ku86, Ligase IV and XRCC1)(112). NHEJ is involved in the repair of double strand breaks induced by irradiation and chemicals in other eukaryotes (such as yeast and humans), and is also involved in a few cellular processes that create double strand breaks (for example, VDJ recombination in the immune system in humans). The role of NHEJ in repairing radiation-induced double strand breaks varies between species (114). For example, in humans, cells with defects in NHEJ are highly sensitive to -irradiation while yeast mutants are not. Double strand breaks in yeast are repaired primarily by homologous recombination. As NHEJ is involved in regulating telomere stability in other organisms, its apparent absence in P. falciparum may explain some of the unusual properties of the telomeres in this species (115).

Back to the story
Anyway … back to the story.  I do not have current access to all of TIGR’s old computer systems which is where my searches for the genome paper reside.  But I figured I might have some notes somewhere on my computer about what blast parameters I used for these searches.  And amazingly I did.  As I was getting ready to write back to Malcolm and to the person who has asked for the information I decided to double check to see what was in the paper.  And amazingly, much of the detail was right there all along.   

Plasmodium falciparum proteins were searched against a database of proteins from all complete genomes as well as from a set of organelle, plasmid and viral genomes. Putative recently duplicated genes were identified as those encoding proteins with better BLASTP matches (based on E value with a 10-15 cutoff) to other proteins in P. falciparum than to proteins in any other species. Proteins of possible organellar descent were identified as those for which one of the top six prokaryotic matches (based on E value) was to either a protein encoded by an organelle genome or by a species related to the organelle ancestors (members of the Rickettsia subgroup of the -Proteobacteria or cyanobacteria). Because BLAST matches are not an ideal method of inferring evolutionary history, phylogenetic analysis was conducted for all these proteins. For phylogenetic analysis, all homologues of each protein were identified by BLASTP searches of complete genomes and of a non-redundant protein database. Sequences were aligned using CLUSTALW, and phylogenetic trees were inferred using the neighbour-joining algorithms of CLUSTALW and PHYLIP. For comparative analysis of eukaryotes, the proteomes of all eukaryotes for which complete genomes are available (except the highly reduced E. cuniculi) were searched against each other. The proportion of proteins in each eukaryotic species that had a BLASTP match in each of the other eukaryotic species was determined, and used to infer a ‘whole-genome tree’ using the neighbour-joining algorithm. Possible eukaryotic conserved and specific proteins were identified as those with matches to all the complete eukaryotic genomes (10-30 E-value cutoff) but without matches to any complete prokaryotic genome (10-15 cutoff).

Alas, I cannot for the life of me find what other parameters I used for the blastp searches.  I am 99.9999% sure I used default settings but alas, I don’t know what default settings for blast were in that era.  And I am not even sure which version of blastp was installed on the TIGR computer systems then.  I certainly need to do a better job of making sure everything I do is truly reproducible.


This all brings me to the actual real part of this story.  Reproducibility.  It is a big deal.  Anyone should be able to reproduce what was done in a study.  And alas, it is difficult to do that when not all the methods are fully described.  And one should also provide intermediate results so that people to do not have to redo everything you did in a study but can just reproduce part of it.   It would be good to have, for example, released all the phylogenetic trees from the analysis of organellar genes in Plasmodium.  Alas, I do not seem to have all of these files as they were stored in a directory at TIGR dedicated to this genome project and as I am no longer at TIGR I do not have ready access to that material.  It is probably still lounging around somewhere on the JCVI computer systems (TIGR alas, no longer officially exists … it was swallowed by the J. Craig Venter Institute …).  But I will keep digging and I will post them to some place like FigShare if/when I find them.

Perhaps more importantly, I will be working with my lab to make sure that in the future we store/record/make available EVERYTHING that would allow people to reproduce, re-analyze, re-jigger, re-whatever anything from our papers.

The key lesson – plan in advance for how you are going to share results, methods, data, etc …

Profile of Michael Turelli in the Sacramento Bee

Pretty good profile of Michael Turelli in the Sacramento Bee: UCD professor Michael Turelli finds biomathematics work ‘ridiculously satisfying’ – Living Here – The Sacramento Bee.  It discusses his career from PhD work to early research to his new work on Wolbachia.  Note of lack of objectivity on my part – Turelli was the first person to recruit me to UC Davis and, well, I love him.  He simply is great …

Silly microbiologist, genomes are for mutualists

New paper in PLoS Genetics of possible interest: PLoS Genetics: Population Genomics of the Facultatively Mutualistic Bacteria Sinorhizobium meliloti and S. medica.

The abstract does an OK job with the technical details:

The symbiosis between rhizobial bacteria and legume plants has served as a model for investigating the genetics of nitrogen fixation and the evolution of facultative mutualism. We used deep sequence coverage (>100×) to characterize genomic diversity at the nucleotide level among 12 Sinorhizobium medicae and 32 S. meliloti strains. Although these species are closely related and share host plants, based on the ratio of shared polymorphisms to fixed differences we found that horizontal gene transfer (HGT) between these species was confined almost exclusively to plasmid genes. Three multi-genic regions that show the strongest evidence of HGT harbor genes directly involved in establishing or maintaining the mutualism with host plants. In both species, nucleotide diversity is 1.5–2.5 times greater on the plasmids than chromosomes. Interestingly, nucleotide diversity in S. meliloti but not S. medicae is highly structured along the chromosome – with mean diversity (θπ) on one half of the chromosome five times greater than mean diversity on the other half. Based on the ratio of plasmid to chromosome diversity, this appears to be due to severely reduced diversity on the chromosome half with less diversity, which is consistent with extensive hitchhiking along with a selective sweep. Frequency-spectrum based tests identified 82 genes with a signature of adaptive evolution in one species or another but none of the genes were identified in both species. Based upon available functional information, several genes identified as targets of selection are likely to alter the symbiosis with the host plant, making them attractive targets for further functional characterization.

I think the author summary is a bit more, well, friendly:

Facultative mutualisms are relationships between two species that can live independently, but derive benefits when living together with their mutualistic partners. The facultative mutualism between rhizobial bacteria and legume plants contributes approximately half of all biologically fixed nitrogen, an essential plant nutrient, and is an important source of nitrogen to both natural and agricultural ecosystems. We resequenced the genomes of 44 strains of two closely related species of the genus Sinorhizobium that form facultative mutualisms with the model legme Medicago truncatula. These data provide one of the most complete examinations of genomic diversity segregating within microbial species that are not causative agents of human illness. Our analyses reveal that horizontal gene transfer, a common source of new genes in microbial species, disproportionately affects genes with direct roles in the rhizobia-plant symbiosis. Analyses of nucleotide diversity segregating within each species suggests that strong selection, along with genetic hitchhiking has sharply reduced diversity along an entire chromosome half in S. meliloti. Despite the two species’ ecological similarity, we did not find evidence for selection acting on the same genetic targets. In addition to providing insight into the evolutionary history of rhizobial, this study shows the feasibility and potential power of applying population genomic analyses to microbial species.

I have highlighted the section dissing pathogen studies …

As with every good paper, it starts with a tree

Figure 1. Neighbor-joining trees showing relationships among 32 S. meliloti (blue squares) and 12 S. medicae (red circles). 
A) chromosomes, B) pSymA and pSMED02, and C) pSymB and pSMED01. Trees were constructed using sequences from coding regions only. The length of the branch separating S. medicae from S. meliloti strains is shown at a scale that is 5% of the true scale. The 24-strain S. meliloti group is marked by asterisks. All branches had 100% bootstrap support unless otherwise indicated. Branches with <80% bootstrap support were collapsed into polytomies. An identical tree with strain identifications is provided as Figure S2.

The tree lays out the phylogeny of the strains sequenced in this study.  And it provides the main framework for much of the rest of the paper.  
Some comments:
  • The genomes were sequenced to ~ 100x coverage with on an Illumina GAIIx.
  • Reads were then aligned to reference genomes of close relatives of the sequenced strains.
  • These alignments were then used for various comparative and population genetic analyses
  • As far as I can tell no de novo assemblies were done
  • I am quite confused about their methods for detecting putative regions that have undergone horizontal gene transfer:
    • In the methods: “We identified genes likely to have experienced recent horizontal gene transfer by comparing the ratio of polymorphisms that were shared between species to fixed differences between species. Based on the whole-genome distribution of this ratio (Figure S3) we identified putatively transferred genes as those with a ratio of shared polymorphisms to fixed differences >0.2.”
    • Not sure how/why this should work.  Not saying it is a bad idea – I just don’t really understand it.
  • They also examine various population genetic parameters including possible selection, SNPs, Tajima’s D, and more. 
It is worth a read.  They summarize their various findings with:

Population genetic analyses of nucleotide diversity segregating within Sinorhizobium medicae and S. meliloti have provided unprecedented insight into the evolutionary history of these ecologically important facultative symbionts. While previous analyses have detected evidence for horizontal gene transfer between these species, our data reveal that gene transfer is restricted almost exclusively to plasmid genes and that the plasmid regions that show evidence of transfer have less interspecific divergence than other genomic regions. Interestingly, nucleotide variation segregating within a 24-strain subpopulation of S. meliloti is highly structured along the chromosome, with one half of the chromosome harboring approximately one-fifth as much diversity as the other. The causes of the difference between the two chromosome halves may be a selective sweep coupled with extensive hitchhiking, if this is correct it would suggest that bouts of strong selection may be important in driving the divergence of bacterial species. Finally, we’ve identified genes that bear a signature of having evolved in response to recent positive selection. Functional characterization of these genes will provide insight into the selective forces that drive rhizobial adaptation.

Not a #badomics word but – "Evolutionary Systems Biology" is – well – pretty complex

Just saw the title of this article Evolutionary Systems Biology: Historical and Philosophical Perspectives on an Emerging Synthesis by Maureen A. O’Malley in Advances in Experimental Medicine and Biology, 2012, Volume 751, 1-28, DOI: 10.1007/978-1-4614-3567-9_1.  And my first thought was “Hmmm – WTF is Evolutionary Systems Biology“.  Given that I am still unsure what Systems Biology is exactly I figured – this could be a doozy.  Thus I perused the abstract:

According to the abstract

Systems biology (SB) is at least a decade old now and maturing rapidly. A more recent field, evolutionary systems biology (ESB), is in the process of further developing system-level approaches through the expansion of their explanatory and potentially predictive scope. This chapter will outline the varieties of ESB existing today by tracing the diverse roots and fusions that make up this integrative project. My approach is philosophical and historical. As well as examining the recent origins of ESB, I will reflect on its central features and the different clusters of research it comprises. In its broadest interpretation, ESB consists of five overlapping approaches: comparative and correlational ESB; network architecture ESB; network property ESB; population genetics ESB; and finally, standard evolutionary questions answered with SB methods. After outlining each approach with examples, I will examine some strong general claims about ESB, particularly that it can be viewed as the next step toward a fuller modern synthesis of evolutionary biology (EB), and that it is also the way forward for evolutionary and systems medicine. I will conclude with a discussion of whether the emerging field of ESB has the capacity to combine an even broader scope of research aims and efforts than it presently does.

I am not sure what to say here.  The author has published some interesting work previously on philosophical issues in biology.   But from the abstract – well – I am pretty lost.  It seems that ESB covers a lot of ground.  First – systems biology – whatever it means –  itself is pretty broad.  And then on top of that, ESB apparently covers even more than SB.  Still not sure what ESB is — I am torn about whether it could be interesting or completely flaky.  I am (as many know) a big fan of adding evolutionary approaches to just about any area of biology.  So that alone makes me think about reading the paper to see whether there is any there there.  But alas, I do not have access, so I am going to have to move on to something else.

Some comments from the web on this paper

Does phylogeny matter? (In Eco-Evo meta-analyses) … Apparently, yes, but it depends.

As many may know – I am pretty obsessed with the uses of phylogeny in biological studies.  In fact, one could say this has driven almost all of my work.  Thus when an email went around a little bit ago about an article for a journal club at UC Davis where the title begins with “Does phylogeny matter?”, well, I had to take a look.  Alas, I was a bit worried when I saw the article was in Ecology Letters because I am at home and was not sure about access policies for this journal.

But I was pleasantly surprised to get full access without any library – VPN login to the following article: Does phylogeny matter? Assessing the impact of phylogenetic information in ecological meta-analysis – Chamberlain – 2012 – Ecology Letters – Wiley Online Library.  I cannot figure out WHY it is freely available right now, nor how long it will be, but I took the chance to look the article over.

And I was even more pleasantly surprised to look over the article.  Many meta-analyses can seem forced – if not almost unbearable to look through.  But this one is very well done.  Basically they did a massive comparison of conclusions that one could reach when one either does or does not take into account the phylogenetic non-independence of taxa when conducting meta-analyses in evo-eco studies.  They searched published literature for meta-analyses and then .. well I will use their words here (from the end of their introduction):

Herein, we re-analyse datasets from previously published meta-analytic studies, comparing results of traditional and phylogenetic meta-analyses. In addition, we attempt to explain variation in the effect of phylogenetic information on meta-analytic outcomes by examining characteristics of phylogenies. We ask: (1) how does accounting for phylogenetic non-independence change results of individual meta-analyses? and (2) across datasets, what characteristics of phylogenies explain changes in effect size for phylogenetic vs. traditional meta-analyses? As a complement to our main questions, in Appendix A, we also ask (3) how does accounting for phylogenetic non-independence affect model fit of individual meta-analyses? and (4) across datasets, what characteristics of phylogenies explain variation in the relative fit of phylogenetic meta-analyses? Despite the many compelling reasons to incorporate phylogenetic information into meta-analyses that involve multiple species, investigators often use model comparison criteria, such as Akaike’s Information Criterion (AIC) to assess fit of phylogenetic vs. traditional meta-analytic models. We found a clear bias in relation to phylogeny size for one of the two methods currently used to quantify relative model fit (Q-based AIC), thus our findings have important implications for meta-analysts using such model comparisons (see Appendix A for details).

And the key conclusions are

Here, we have shown that incorporating phylogenies influences ecological meta-analysis outcomes, in many cases changing whether the observed effect size differs significantly from zero. We also show that the degree of difference between traditional and phylogenetic meta-analyses depends on key characteristics of phylogenies. Despite this potential complication, we strongly recommend incorporating phylogenetic information into ecological meta-analyses to account for species non-independence.

They also offer up three main recommendations for consideration

To conclude, we outline three recommendations for the use of phylogenetic meta-analyses in ecology and evolutionary biology:

  1. Use phylogenetic meta-analysis, but note that some response metrics are less likely to be affected by phylogenetic methods.
  2. Include as many species as possible.
  3. Be aware that phylogeny shape may influence meta-analytic outcomes. 

Definitely worth a look …

Chamberlain, S., Hovick, S., Dibble, C., Rasmussen, N., Van Allen, B., Maitner, B., Ahern, J., Bell-Dereske, L., Roy, C., Meza-Lopez, M., Carrillo, J., Siemann, E., Lajeunesse, M., & Whitney, K. (2012). Does phylogeny matter? Assessing the impact of phylogenetic information in ecological meta-analysis Ecology Letters, 15 (6), 627-636 DOI: 10.1111/j.1461-0248.2012.01776.x

Guest post: Story Behind the Paper by Joshua Weitz on Neutral Theory of Genome Evolution

I am very pleased to have another in my “Story behind the paper” series of guest posts.  This one is from my friend and colleague Josh Weitz from Georgia Tech regarding a recent paper of his in BMC Genomics.  As I have said before – if you have published an open access paper on a topic related to this blog and want to do a similar type of guest post let me know …

A guest blog by Joshua Weitz, School of Biology and Physics, Georgia Institute of Technology

Summary This is a short, well sort-of-short, story of the making of our paper: “A neutral theory of genome evolution and the frequency distribution of genes” recently published in BMC Genomics. I like the story-behind-the-paper concept because it helps to shed light on what really happens as papers move from ideas to completion. It’s something we talk about in group meetings but it’s nice to contribute an entry in this type of forum.  I am also reminded in writing this blog entry just how long science can take, even when, at least in this case, it was relatively fast. 

The pre-history The story behind this paper began when my former PhD student, Andrey Kislyuk (who is now a Software Engineer at DNAnexus) approached me in October 2009 with a paper by Herve Tettelin and colleagues.  He had read the paper in a class organized by Nicholas Bergman (now at NBACC). The Tettelin paper is a classic, and deservedly so.  It unified discussions of gene variation between genomes of highly similar isolates by estimating the total size of the pan and core genome within multiple sequenced isolates of the pathogen Streptococcus agalactiae.  

However, there was one issue that we felt could be improved: how does one extrapolate the number of genes in a population (the pan genome) and the number of genes that are found in all individuals in the population (the core genome) based on sample data alone?  Species definitions notwithstanding, Andrey felt that estimates depended on details of the alignment process utilized to define when two genes were grouped together.  Hence, he wanted to evaluate the sensitivity of core and pan geonme predictions to changes in alignment rules.  However, it became clear that something deeper was at stake.  We teamed up with Bart Haegeman, who was on an extended visit in my group from his INRIA group in Montpellier, to evaluate whether it was even possible to quantitatively predict pan and core genome sizes. We concluded that pan and core genome size estimates were far more problematic than had been acknowledged.  In fact, we concluded that they depended sensitively on estimating the number of rare genes and rare genomes, respectively.  The basic idea can be encapsulated in this figure:

The top panels show gene frequency distributions for two synthetically generated species.  Species A has a substantially smaller pan genome and a substantially larger core genome than does Species B.  However, when one synthetically generates a sample set of dozens, even hundreds of genomes, then the rare genes and genomes that correspond to differences in pan and core genome size, do not end up changing the sample rarefaction curves (seen at the bottom, where the green and blue symbols overlap).  Hence, extrapolation to the community size will not necessarily be able to accurately estimate the size of the pan and core genome, nor even which is larger!

As an alternative, we proposed a metric we termed “genomic fluidity” which captures the dissimilarity of genomes when comparing their gene composition.

The quantitative value of genomic fluidity of the population can be estimated robustly from the sample.  Moreover, even if the quantitative value depends on gene alignment parameters, its relative order is robust.  All of this work is described in our paper in BMC Genomics from 2011: Genomic fluidity: an integrative view of gene diversity within microbial populations.

However, as we were midway through our genomic fluidity paper, it occurred to us that there was one key element of this story that merited further investigation.  We had termed our metric “genomic fluidity” because it provided information on the degree to which genomes were “fluid“, i.e., comprised of different sets of genes.  The notion of fluidity also implies a dynamic, i.e., a mechanism by which genes move. Hence, I came up with a very minimal proposal for a model that could explain differences in genomic fluidity.  As it turns out, it can explain a lot more.

A null model: getting the basic concepts together
In Spring 2010, I began to explore a minimal, population-genetics style model which incorporated a key feature of genomic assays, that the gene composition of genomes differs substantially, even between taxonomically similar isolates. Hence, I thought it would be worthwhile to  analyze a model in which the total number of individuals in the population was fixed at N, and each individual had exactly M genes.  Bart and I started analyzing this together. My initial proposal was a very simple model that included three components: reproduction, mutation and gene transfer. In a reproduction step, a random individual would be selected, removed and then replaced with one of the remaining N-1 individuals.  Hence, this is exactly analogous to a Moran step in a standard neutral model.  At the time, what we termed mutation was actually representative of an uptake event, in which a random genome was selected, one of its genes was removed, and then replaced with a new gene, not found in any other of the genomes.  Finally, we considered a gene transfer step in which two genomes would be selected at random, and one gene from a given genome would be copied over to the second genome, removing one of the previous genes.  The model, with only birth-death (on left) and mutation (on right), which is what we eventually focused on for this first paper, can be depicted as follows:

We proceeded based on our physics and theoretical ecology backgrounds, by writing down master equations for genomic fluidity as a function of all three events. It is apparent that reproduction decreases genomic fluidity on average, because after a reproduction event, two genomes have exactly the same set of genes.  Likewise, gene transfer (in the original formulation) also decreases genomic fluidity on average, but the decrease is smaller by a factor of 1/M, because only one gene is transferred.  Finally, mutation increases genomic fluidity on average, because a mutation event occurring at a gene which had before occurred in more than one genome, introduces a new singleton gene in the population, hence increasing dissimilarity. The model was simple, based on physical principles, was analytically tractable, at least for average quantities like genomic fluidity, and moreover it had the right tension.  It considered a mechanism for fluidity to increase and two mechanisms for fluidity to decrease.  Hence, we thought this might provide a basis for thinking about how relative rates of birth-death, transfer and uptake might be identified from fluidity.  As it turns out, many combinations of such parameters lead to the same value of fluidity.  This is common in models, and is often referred to as an identifiability problem. However, the model could predict other things, which made it much more interesting.   

The making of the paper
The key moment when the basic model, described above, began to take shape as a paper occurred when we began to think about all the data that we were not including in our initial genomic fluidity analysis.  Most prominently, we were not considering the frequency at which genes occurred amongst different genomes.  In fact, gene frequency distributions had already attracted attention.  A gene frequency distribution summarizes the number of genes that appear in exactly k genomes. The frequency with which a gene appears is generally thought to imply something about its function, e.g., “Comprising the pan-genome are the core complement of genes common to all members of a species and a dispensable or accessory genome that is present in at leastbone but not all members of a species.” (Laing et al., BMC Bioinformatics 2011).  The emphasis is mine. But does one need to invoke selection, either implicitly or explicitly, to explain differences in gene frequency? 

As it turns out, gene frequency distributions end up having a U-shape, such that many genes appear in 1 or a few genomes, many in all genomes (or nearly all), and relatively few occur at intermediate levels.  We had extracted such gene frequency distributions from our limited dataset of ~100 genomes over 6 species.  Here is what they look like:

And, when we began to think more about our model, we realized that the tension that led to different values of genomic fluidity also generated the right sort of tension corresponding to U-shaped gene frequency distributions.  On the one-hand, mutations (e.g., uptake of new genes from the environment) would contribute to shifting the distribution to the left-hand-side of the U-shape.  On the other hand, birth-death would contribute to shifting the distribution to the right-hand side of the U-shape.  Gene transfer between genomes would also shift the distribution to the right. Hence, it seemed that for a given set of rates, it might be possible to generate reasonable fits to empirical data that would generate a U-shape. In doing so, that would mean that the U-shape was not nearly as informative as had been thought.  In fact, the U-shape could be anticipated from a neutral model in which one need not invoke selection. This is an important point as it came back to haunt us in our first round of review.

So, let me be clear: I do think that genes matter to the fitness of an organism and that if you delete/replace certain genes you will find this can have mild to severe to lethal costs (and occasional benefits).  However, our point in developing this model was to try and create a baseline null model, in the spirit of neutral theories of population genetics, that would be able to reproduce as much of the data with as few parameters as possible.  Doing so would then help identify what features of gene compositional variation could be used as a means to identify the signatures of adaptation and selection.  Perhaps this point does not even need to be stated, but obviously not everyone sees it the same way.  In fact, Eugene Koonin has made a similar argument in his nice paper, Are there laws of adaptive evolution: “the null hypothesis is that any observed pattern is first assumed to be the result of non-selective, stochastic processes, and only once this assumption is falsified, should one start to explore adaptive scenarios”.  I really like this quote, even if I don’t always follow this rule (perhaps I should). It’s just so tempting to explore adaptive scenarios first, but it doesn’t make it right.

At that point, we began extending the model in a few directions.  The major innovation was to formally map our model onto the infinitely many alleles model of population genetics, so that we could formally solve our model using the methods of coalescent theory for both cases of finite population sizes and for exponentially growing population sizes.  Bart led the charge on the analytics and here’s an example of the fits from the exponentially growing model (the x-axis is the number of genomes):

At that point, we had a model, solutions, fits to data, and a message.  We solicited a number of pre-reviews from colleagues who helped us improve the presentation (thank you for your help!).  So, we tried to publish it.    

Trying to publish the paper
We tried to publish this paper in two outlets before finding its home in BMC Genomics.  First, we submitted the article to PNAS using their new PNAS Plus format.  We submitted the paper in June 2011 and were rejected with an invitation to resubmit in July 2011. One reviewer liked the paper, apparently a lot: “I very much like the assumption of neutrality, and I think this provocative idea deserves publication.”  The same reviewer gave a number of useful and critical suggestions for improving the manuscript.  Another reviewer had a very strong negative reaction to the paper. Here was the central concern: “I feel that the authors’ conclusion that the processes shaping gene content in bacteria and primarily neutral are significantly false, and potentially confusing to readers who do not appreciate the lack of a good fit between predictions and data, and who do not realise that the U-shaped distributions observed would be expected  under models where it is selection that determines gene number.”  There was no disagreement over the method or the analysis.  The disagreement was one of what our message was.

I still am not sure how this confusion arose, because throughout our first submission and our final published version, we were clear that the point of the manuscript was to show that the U-shape of gene frequency distributions provide less information than might have been thought/expected about selection.  They are relatively easy to fit with a suite of null models.  Again, Koonin’s quote is very apt here, but at some basic level, we had an impasse over a philosophy of the type of science we were doing. Moreover, although it is clear that non-neutral processes are important, I would argue that it is also incorrect to presume that all genes are non-neutral.  There’s lots of evidence that many transferred genes have little to no effect on fitness. We revised the paper, including and solving alternative models with fixed and flexible core genomes, again showing that U-shapes are rather generic in this class of models.  We argued our point, but the editor sided with the negative review, rejecting our paper in November after resubmission in September, with the same split amongst the reviewers. 

Hence, we resubmitted the paper to Genome Biology, which rejected it at the editorial level after a few week delay without much of an explanation, and at that point, we decided to return to BMC Genomics, which we felt had been a good home for our first paper in this area and would likely make a good home for the follow-up.  A colleague once said that there should be an r-index, where r is the number of rejections a paper received before ultimate acceptance.  He argued that r-indices of 0 were likely not good (something about if you don’t fall, then you’re not trying) and an r-index of 10 was probably not good either.  I wonder what’s right or wrong. But I’ll take an r of 2 in this case, especially because I felt that the PNAS review process really helped to make the paper better even if it was ultimately rejected. And, by submitting to Genome Biology, we were able to move quickly to another journal in the same BMC consortia.

Upcoming plans
Bart Haegeman and I continue to work on this problem, from both the theory and bioinformatics side.  I find this problem incredibly fulfilling.  It turns out that there are many features of the model that we still have not fully investigated.  In addition, calculating gene frequency distributions involves a number of algorithmic challenges to scale-up to large datasets.  We are building a platform to help, to some extent, but are looking for collaborators who have specific algorithmic interests in these types of problems.  We are also in discussions with biologists who want to utilize these types of analysis to solve particular problems, e.g., how can the analysis of gene frequency distributions be made more informative with respect to understanding the role of genes in evolution and the importance of genes to fitness.  I realize there are more of such models out there tackling other problems in quantitative population genomics (we cite many of them in our BMC Genomics paper), including some in the same area of understanding the core/pan genome and gene frequency distributions. I look forward to learning from and contributing to these studies.