Convoluted title, cool paper in #PLoSGenetics on relative of insect mutualists causing a human infection

Saw this tweet a few minutes ago:

//platform.twitter.com/widgets.js
The title of the paper took me a reread or two to understand.  But once I got what they were trying to say I was intrigued.  And so I went to the paper:  PLOS Genetics: A Novel Human-Infection-Derived Bacterium Provides Insights into the Evolutionary Origins of Mutualistic Insect–Bacterial Symbioses.  And it is loaded with interesting tidbits.  First, the first section of the results details the history of the infection in a 71 year old male and his recovery and the isolation and characterization of a new bacterial strain.  Phylogenetic analysis revealed this was a close relative of the Sodalis endosymbionts of insects.

And then comparative genomics revealed a bit more detail about the history of this strain, it’s relatives, and some of the insect endosymbionts.  And plus, it allowed the authors to make some jazzy figures such as

And this and other comparative analyses revealed some interesting findings.  As summarize by the authors

Our results indicate that ancestral relatives of strain HS have served as progenitors for the independent descent of Sodalis-allied endosymbionts found in several insect hosts. Comparative analyses indicate that the gene inventories of the insect endosymbionts were independently derived from a common ancestral template through a combination of irreversible degenerative changes. Our results provide compelling support for the notion that mutualists evolve from pathogenic progenitors. They also elucidate the role of degenerative evolutionary processes in shaping the gene inventories of symbiotic bacteria at a very early stage in these mutualistic associations.

The paper is definitely worth a look.

Guest post on "CHANCE" ChIP-seq QC and validation software

Guest post by Aaron Diaz from UCSF on a software package called CHANCE which is for ChIP-seq analyses.  Aaron wrote to me telling me about the software and asking if I would consider writing about it on my blog.  Not really the normal topic of my blog but it is open source and published in an open access journal and is genomicy and bioinformaticy in nature.   So I wrote back inviting him to write about it.  Here is his post:


CHANCE: A comprehensive and easy-to-use graphical software for ChIP-seq quality control and validation



Our recent paper presents CHANCE a user-friendly software for ChIP-seq QC and protocol optimization. Our user-friendly graphical software quickly estimates the strength and quality of immunoprecipitations, identifies biases, compares the user’s data with ENCODE’s large collection of published datasets, performs multi-sample normalization, checks against qPCR-validated control regions, and produces publication ready graphical reports. CHANCE can be downloaded here.

An overview of ChIP-seq: cross-
linked chromatin is sheared,
enriched for a transcription factor
or epigenetic mark of interest
using an antibody, purified and
sequenced.

Chromatin immunoprecipitation followed by high throughput sequencing (ChIP-seq) is a powerful tool for constructing genome wide maps of epigenetic modifications and transcription factor binding sites. Although this technology enables the study of transcriptional regulation with unprecedented scale and throughput interpreting the resulting data and knowing when to trust the data can be difficult. Also, when things go wrong it is hard to know where to start when troubleshooting. CHANCE provides a variety of tests to help debug library preparation protocols.

One of the primary uses of CHANCE is to check the strength of the IP. CHANCE produces a summary statement which will give you an estimate of the percentage of the IP reads which map DNA fragments pulled down by the antibody used for the ChIP. In addition to the size of this signal component within the IP CHANCE reports the fraction of the genome these signal reads cover, as well as the statistical significance of the genome wide percentage enrichment relative to control in the form of a q-value (positive false discovery rate). CHANCE has been trained on CHIP-seq experiments from the ENCODE repository by making over 10,000 Input to IP and Input to replicate Input comparisons. The q-value reported gives then the fraction of comparisons between Input sample techinical replicates that report an enrichment for signal in one sample compared to another equal to the user provided sample or greater. CHANCE identifies insufficient sequencing depth, PCR amplification bias in library preparation, and batch effects.

CHANCE identifies biases in sequence content and quality, as well as cell-type and laboratory-dependent biases in read density. Read-density bias reduces the statistical power to distinguish subtle but real enrichment from background noise. CHANCE visualizes base-call quality and nucleotide frequency with heat maps. Furthermore, efficient techniques borrowed from signal processing uncover biases in read density caused by sonication, chemical digestion, and library preparation.

A typical IP enrichment report.

CHANCE cross-validates enrichment with previous ChIP-qPCR results. Experimentalists frequently use ChIP-qPCR to check the enrichment of positive control regions and the background level of negative control regions in their IP DNA relative to Input DNA. It is thus important to verify whether those select regions originally checked with PCR are captured correctly in the sequencing data. CHANCE’s spot-validation tool provides a fast way to perform this verification. CHANCE also compares enrichment in the user’s experiment with enrichment in a large collection of experiments from public ChIP-seq databases.

CHANCE has a user friendly graphical interface.
How CHANCE might be used to provide feedback on protocol optimization.

Quick post: nice #openaccess review: Insights from Genomics into Bacterial Pathogen Populations

Just a quick post here.  There is a new review/commentary that may be of interest: PLOS Pathogens: Insights from Genomics into Bacterial Pathogen Populations.  By Daniel Wilson from the Wellcome Trust Centre at Oxford.

Full citation: Wilson DJ (2012) Insights from Genomics into Bacterial Pathogen Populations. PLoS Pathog 8(9): e1002874. doi:10.1371/journal.ppat.1002874

It is a nice and useful review …

Winner of the "genome conference speakers should be male" award …

Presenters at the World Genome Data Analysis Summit.  Women highlighted in yellow.

  1. Richard LeDuc, Manager, National Center for Genome Analysis Support, Indiana University
  2. Gholson Lyon, Assistant Professor, Cold Spring Harbour Laboratory
  3. Christopher Mason, Assistant Professor, Cornell University
  4. Liz Worthey, Assistant Professor, Medical College of Wisconsin
  5. Garry Nolan, Professor of Genetics, Stanford University
  6. David Dooling, Assistant Director, Genome Institute, Washington University
  7. Peter Robinson, Senior Technical Marketing Manager, DataDirect Networks
  8. Thomas Keane, Senior Scientific Manager, Sequencing Informatics, Wellcome Trust Sanger Institute
  9. Eric Fauman, Associate Research Fellow, Pfizer
  10. Geetha Vasudevan, Assistant Director and Bioinformatics Scientist, Bristol-Myers Squibb
  11. Shanrong Zhao, Senior Scientist, Johnson & Johnson
  12. Bill Barnett, Director, National Center for Genome Analysis Support, Indiana University
  13. Zemin Zhang, Senior Scientist, Bioinformatics, Computational Biology, Genentech
  14. Christopher Mason, Assistant Professor, Cornell University
  15. James Cai, Head, Disease & Translational Informatics, Roche
  16. Eric Zheng, Fellow of Bioinformatics Science, Regeneron
  17. Monica Wang, Associate Director, Knowledge Engineering, Millennium
  18. Joachim Theilhaber, Lead Bioinformatics Research Investigator, Sanofi
  19. Francisco De La Vega, Visiting Scholar, Stanford University
  20. Don Jennings, Manager of Data Integration, Enterprise Information Management, Eli Lilly
  21. Deepak Rajpal, Senior Scientific Investigator, Computational Biology, GSK
  22. Mark Schreiber, Associate Director, Knowledge Engineering, Novartis

So that is a ratio of 19:3 for a whopping 13.6% women.  Please – I beg of you – if you are organizing a conference give some thought to the diversity of speakers.  In my experience the best conferences have always ended up being ones with highly diverse speakers.  These conferences were good probably because the organizers put a lot of thought into who to invite to speak, rather than just inviting either big names or people that one knew in some way.

UPDATE: It has been pointed out that I listed one person (Chris Mason) twice — so it is only an 18:3 ratio.  Phew.  Much better.

For other posts on this topic see

"Genomics: the Power and the Promise" meeting – could be called "Men Studying Genomics" instead

Just got another email advertising this meeting: Genomics: the Power and the Promise.  Organized by Genome Canada and the Gairdner Foundation.  And, well, though I love some of the things Genome Canada has done, this conference really stick in my craw in the wrong way. Why?  It has a serious male speaker overabundance.  Here is the list of speakers:

Day 1 

  1. Pierre Muelien
  2. John Dirks
  3. Gary Goodyear
  4. Eric Lander
  5. Craig Venter
  6. Philip Sharp
  7. Svante Paabo
  8. Tom Hudson
  9. Peter Jones
  10. Stephen Scherer
  11. Michael Hayden
  12. Bertha Maria Knoppers

Day 2

  1. Stephen Mayfield
  2. Elizabeth Edwards
  3. Curtis Suttle
  4. Peter Langridge
  5. Michel Georges
  6. William Davidson
  7. Klaus Ammann

That is 17:2 male: female ratio. That is one female speaker per day.  Not impressive.

On Day 2 there are two panels (which generally I do not count as “speakers” but at least there are a few more women on these):

  • Panel 1: Sally Aitken, Vincent Martin, Elizabeth Edwards, Curtis Suttle, Gerrit Voordouw, Steve Yearley
  • Panel 2: William Davidson, Martine Dubuc, Isobel Parkin, Graham Plastow, Curtis Pozniak, Peter Phillips 

So if you count these that then comes to a ratio of presenters of 25: 6.  Do I want quotes for meetings?  No, but given that the ratio of men: women in biology is close to 1:1 this suggests to me some sort of bias.  Where does this bias come from?  I don’t know.  Could be at the level of who gets invited.  Could be at the level of who accepts.  Could be some non obvious criterion for selecting speakers that leads to a bias towards men.  I don’t know.  But I personally think they could do better.  And I note – they could probably do better in terms of other aspects of diversity of speakers, but I am focusing here just on the male vs. female ratio.  Again, I am not suggesting one should have quotas for all meetings but at the same time, a 17:2 male to female speaker ratio suggests something could use some working on.

As a side story I decided to look at some past conferences sponsored by Genome Canada.  I worked my way down the list … see below:

  • 2008 Joint IUFRO-CTIA International conference. Speakers: 8:2 male: female
  • 6th Canadian Plant Genomics Workshop Plenary Speakers 8:2
  • 8th Annual International Conference of the Canadian Proteomics Initiative.  See below.  32:2 male to female.  I have no idea what the ratio is in the field of proteomics but this is a very big skew in the ratio.  94% male.  
    1. Leigh Anderson (Plasma Proteome Institute)
    2. Ron Beavis (UBC)
    3. John Bergeron (McGill)
    4. Christoph Borchers (UVic)
    5. Jens Coorssen (U Calgary)
    6. Al Edwards (U Toronto)
    7. Andrew Emili (U Toronto)
    8. Leonard Foster (UBC)
    9. Jack Greenblatt (U Toronto)
    10. Juergen Kast (UBC)
    11. Gilles Lajoie (U Western Ontario)
    12. Liang Li (U Alberta)
    13. John Marshall (Ryerson)
    14. Susan Murch (UBC Okanagan)
    15. Richard Oleschuk (Queens)
    16. Dev Pinto (NRC)
    17. Guy Poirier (Laval)
    18. Don Riddle (UBC)
    19. David Schreimer (University of Calgary)
    20. Christoph Sensen (University of Calgary)
    21. Michael Siu (York)
    22. John Wilkins (University of Manitoba)
    23. David Wishart (University of Alberta)
    24. Rober McMaster (Universiyt of British Columbia)
    25. Peter Liu (University of Toronto)
    26. Christopher Overall (Universiyt of British Columbia)
    27. John Kelly (NRC, Ottawa)
    28. Joshua N. Adkins (Pacific Northwest National Laboratory, USA)
    29. Dustin N.D. Lippert (University of British Columbia)
    30. David Juncker (McGill University)
    31. Jenya Petrotchenko (University of Victoria)
    32. Detlev Suckau (Bruker Daltonik GmbH)
    33. Peipei Ping (University of California)
    34. Robert McMaster (University of British Columbia)
I couldn’t bear to go on any further.
Now – note – I am not accusing anyone of bias here.  But I do think it might be a good idea for Genome Canada to put some more effort into figuring out why the conferences they sponsor have such skewed ratios.  And perhaps they can try to do something about this.  For more on this issue from my blog see

Silly microbiologist, genomes are for mutualists

New paper in PLoS Genetics of possible interest: PLoS Genetics: Population Genomics of the Facultatively Mutualistic Bacteria Sinorhizobium meliloti and S. medica.

The abstract does an OK job with the technical details:

Abstract:
The symbiosis between rhizobial bacteria and legume plants has served as a model for investigating the genetics of nitrogen fixation and the evolution of facultative mutualism. We used deep sequence coverage (>100×) to characterize genomic diversity at the nucleotide level among 12 Sinorhizobium medicae and 32 S. meliloti strains. Although these species are closely related and share host plants, based on the ratio of shared polymorphisms to fixed differences we found that horizontal gene transfer (HGT) between these species was confined almost exclusively to plasmid genes. Three multi-genic regions that show the strongest evidence of HGT harbor genes directly involved in establishing or maintaining the mutualism with host plants. In both species, nucleotide diversity is 1.5–2.5 times greater on the plasmids than chromosomes. Interestingly, nucleotide diversity in S. meliloti but not S. medicae is highly structured along the chromosome – with mean diversity (θπ) on one half of the chromosome five times greater than mean diversity on the other half. Based on the ratio of plasmid to chromosome diversity, this appears to be due to severely reduced diversity on the chromosome half with less diversity, which is consistent with extensive hitchhiking along with a selective sweep. Frequency-spectrum based tests identified 82 genes with a signature of adaptive evolution in one species or another but none of the genes were identified in both species. Based upon available functional information, several genes identified as targets of selection are likely to alter the symbiosis with the host plant, making them attractive targets for further functional characterization.

I think the author summary is a bit more, well, friendly:

Facultative mutualisms are relationships between two species that can live independently, but derive benefits when living together with their mutualistic partners. The facultative mutualism between rhizobial bacteria and legume plants contributes approximately half of all biologically fixed nitrogen, an essential plant nutrient, and is an important source of nitrogen to both natural and agricultural ecosystems. We resequenced the genomes of 44 strains of two closely related species of the genus Sinorhizobium that form facultative mutualisms with the model legme Medicago truncatula. These data provide one of the most complete examinations of genomic diversity segregating within microbial species that are not causative agents of human illness. Our analyses reveal that horizontal gene transfer, a common source of new genes in microbial species, disproportionately affects genes with direct roles in the rhizobia-plant symbiosis. Analyses of nucleotide diversity segregating within each species suggests that strong selection, along with genetic hitchhiking has sharply reduced diversity along an entire chromosome half in S. meliloti. Despite the two species’ ecological similarity, we did not find evidence for selection acting on the same genetic targets. In addition to providing insight into the evolutionary history of rhizobial, this study shows the feasibility and potential power of applying population genomic analyses to microbial species.

I have highlighted the section dissing pathogen studies …

As with every good paper, it starts with a tree

Figure 1. Neighbor-joining trees showing relationships among 32 S. meliloti (blue squares) and 12 S. medicae (red circles). 
A) chromosomes, B) pSymA and pSMED02, and C) pSymB and pSMED01. Trees were constructed using sequences from coding regions only. The length of the branch separating S. medicae from S. meliloti strains is shown at a scale that is 5% of the true scale. The 24-strain S. meliloti group is marked by asterisks. All branches had 100% bootstrap support unless otherwise indicated. Branches with <80% bootstrap support were collapsed into polytomies. An identical tree with strain identifications is provided as Figure S2.

The tree lays out the phylogeny of the strains sequenced in this study.  And it provides the main framework for much of the rest of the paper.  
Some comments:
  • The genomes were sequenced to ~ 100x coverage with on an Illumina GAIIx.
  • Reads were then aligned to reference genomes of close relatives of the sequenced strains.
  • These alignments were then used for various comparative and population genetic analyses
  • As far as I can tell no de novo assemblies were done
  • I am quite confused about their methods for detecting putative regions that have undergone horizontal gene transfer:
    • In the methods: “We identified genes likely to have experienced recent horizontal gene transfer by comparing the ratio of polymorphisms that were shared between species to fixed differences between species. Based on the whole-genome distribution of this ratio (Figure S3) we identified putatively transferred genes as those with a ratio of shared polymorphisms to fixed differences >0.2.”
    • Not sure how/why this should work.  Not saying it is a bad idea – I just don’t really understand it.
  • They also examine various population genetic parameters including possible selection, SNPs, Tajima’s D, and more. 
It is worth a read.  They summarize their various findings with:

Population genetic analyses of nucleotide diversity segregating within Sinorhizobium medicae and S. meliloti have provided unprecedented insight into the evolutionary history of these ecologically important facultative symbionts. While previous analyses have detected evidence for horizontal gene transfer between these species, our data reveal that gene transfer is restricted almost exclusively to plasmid genes and that the plasmid regions that show evidence of transfer have less interspecific divergence than other genomic regions. Interestingly, nucleotide variation segregating within a 24-strain subpopulation of S. meliloti is highly structured along the chromosome, with one half of the chromosome harboring approximately one-fifth as much diversity as the other. The causes of the difference between the two chromosome halves may be a selective sweep coupled with extensive hitchhiking, if this is correct it would suggest that bouts of strong selection may be important in driving the divergence of bacterial species. Finally, we’ve identified genes that bear a signature of having evolved in response to recent positive selection. Functional characterization of these genes will provide insight into the selective forces that drive rhizobial adaptation.

Is Illumina the "duct tape" of sequencing?

Photo from Wikipedia.  Photo by Evan-Amos

For the last year or so I have become a big fan of Illumina sequencing.  We are using it for everything in the lab.  And many others are using it quite a lot too.  All sorts of interesting applications.  But of course -there are other sequencing systems that each have some advantages relative to Illumina.  And one of the key limitations of Illumina sequencing has been the read length (though that limitation gets less and less as read lengths get longer and longer from Illumina machines).

The UC Davis Genome Center has had Illumina sequencing systems for many years now and we use them extensively.  However, we felt for some time that we and others around town could benefit from complementary methods, especially those that could get longer reads.  So we sought funding to buy other systems.  And fortunately we got an NSF MRI grant to do just that -which we used to buy a Roche 454 Jr machine and contribute to the purchase of a Pacific Biosciences machine.  These are good to have around because they open up new windows into sequencing – not just long reads but other areas as well.  For example, the PacBio system also has the ability to use it to detect modifications to bases like methylation.

Alas, both the 454 and PacBio systems have higher error rates than the Illumina systems.  And this makes some analyses challenging and limits the benefits that come from the longer reads.  So what to do?  For a while people have been using Illumina sequencing to “correct” the errors make by 454 and PacBio sequencing.  And today Matt Herper at Forbes (For A New DNA Sequencer, A Technical Fix May Have Come Too Late – Forbes) discusses a new further improvement in the ability to do this error correction (a paper just came out on the topic from Adam Phillippy, Sergey Koren, Michael Schatz, and others).

I find this whole concept a bit funny / interesting.  Not only does Illumina sequencing have many uses but one of its uses in essence helps keep aloft the potential of some of it’s competitors.  In this way – Illumina can be considered the duct tape of sequencing systems.  1001 uses.  Not sure the Illumina folks will be overly thrilled with this use but that is the way it goes …

(As an aside – any high throughput highly accurate sequencing method could be used in the same way as Illumina in most cases – ABI solid for example.  But alas for ABI Illumina has kind of taken over this part of the market).

(An another aside – we will have to wait and see how/if the Ion Torrent systems take off in the sequencing ecosystem)

(As another aside – still waiting to see some more detail from the Oxford Nanopores folks … I would be happy to be a beta tester if anyone from Oxford is reading this).

Some notes from GSC13 session on microbiology of the built environment #microBEnet

At the GSC13 meeting a few months ago there was a session on microbiology of the built environment which was sponsored by my microBEnet project.

Posting some details from the meeting here.

Meeting notes and reports

Talk videos:

Paula Olsiewski

  http://www.scivee.tv/flash/embedCast.swf

The Indoor Standards – What Parameters Do We Need to Record? Jeffrey Siegel (University of Texas at Austin, USA)

http://www.scivee.tv/flash/embedCast.swf

Minimal Metadata for the Built Environment: A MIxS Extension Lynn Schriml (University of Maryland, USA)

http://www.scivee.tv/flash/embedCast.swf

The Home Microbiome Project: Unraveling the Relationship Between Human-associated and Home-associated Microbial Signatures. Jack Gilbert (University of Chicago/Argonne National Laboratory, USA)

http://www.scivee.tv/flash/embedCast.swf

The Indoor Virome! Scott Kelley (San Diego State University, USA)

http://www.scivee.tv/flash/embedCast.swf

The Role of VAMPs in the MoBEDAC Initiative Mitchell Sogin (Marine Biological Laboratory, Woods Hole, USA)

http://www.scivee.tv/flash/embedCast.swf

MoBEDAC – Handling Fungal Data From MicroBE Jason Stajich (University of California, Riverside, USA)

http://www.scivee.tv/flash/embedCast.swf

MoBeDAC – Integrated data and analysis for the indoor and built environment Folker Meyer (Argonne National Laboratory, USA)

Overselling genomics award #7: Ron Davis & Forbes for PR presented as "essay"

Wow.  Just saw this tweet by Dan Vorhaus:

//platform.twitter.com/widgets.js
So I decided to check it out. The piece is titled It’s Time to Bet on Genomics and it is, well, just completely in appropriate.  Sure – it does take on an article that itself was over the top in downplaying the power of genomics (see Erika Check Hayden’s article about that issue here).  But then Davis goes on to write about a company founded by an ex post doc of his for which Davis is one of the advisors (he does kindly let us know this, but still …).  And what he writes he is a big big pile of fluff with no evidence presented.  Among the lines in the “essay” I find disturbing:

  • One of the most interesting of these is being developed by Genophen
  • Genophen’s application is rather breathtaking in its ambition.
  • Genophen’s “risk engine”—a simple term for some very complex data mining and computer modeling—will map your risk factors against the world’s vast library of medical research and then offer up a personalized set of behavior and treatment recommendations that can help you reduce those risks . . .  and even prevent disease itself.
  • We are now at the point where genomics-enabled medical technology can run various what-if scenarios and show you whether diet, exercise, medication, or some other factor or combination of factors has the greatest statistical likelihood of reducing that risk. The information can then be visually displayed through charts and graphs and made available to patients and their doctors via secure web-based portals.
  • But instinctively I believe it to be true, and anecdotally Genophen’s first trial provided some confirmation.
  • “The trial changed my life,” one female participant who wishes to remain anonymous told me.

All of this without any link to a paper, without any data, without any real details.  Shameful.  Not saying genomic medicine does not have a lot of promise.  But this “essay” is so excessively focused on PR for one company that there is no reason to have any faith in anything said in it.  I am therefore giving Ron Davis and Forbes my coveted Overselling Genomics Award (#7).  Plus I think Forbes deserves some sort of award for “Publishing PR” but I will have to think one of those up.  This piece almost certainly never should have been published at Forbes.Com without many many more caveats.  Yuck.

UPDATE – here is a screenshot from the Forbes Web site.  It is marked as “Forbes Leadership Forum” … hard to tell whether it is meant as an essay, editorial, op-ed, or what.

Guest post: Story Behind the Paper by Joshua Weitz on Neutral Theory of Genome Evolution

I am very pleased to have another in my “Story behind the paper” series of guest posts.  This one is from my friend and colleague Josh Weitz from Georgia Tech regarding a recent paper of his in BMC Genomics.  As I have said before – if you have published an open access paper on a topic related to this blog and want to do a similar type of guest post let me know …

—————————————-
A guest blog by Joshua Weitz, School of Biology and Physics, Georgia Institute of Technology

Summary This is a short, well sort-of-short, story of the making of our paper: “A neutral theory of genome evolution and the frequency distribution of genes” recently published in BMC Genomics. I like the story-behind-the-paper concept because it helps to shed light on what really happens as papers move from ideas to completion. It’s something we talk about in group meetings but it’s nice to contribute an entry in this type of forum.  I am also reminded in writing this blog entry just how long science can take, even when, at least in this case, it was relatively fast. 


The pre-history The story behind this paper began when my former PhD student, Andrey Kislyuk (who is now a Software Engineer at DNAnexus) approached me in October 2009 with a paper by Herve Tettelin and colleagues.  He had read the paper in a class organized by Nicholas Bergman (now at NBACC). The Tettelin paper is a classic, and deservedly so.  It unified discussions of gene variation between genomes of highly similar isolates by estimating the total size of the pan and core genome within multiple sequenced isolates of the pathogen Streptococcus agalactiae.  

However, there was one issue that we felt could be improved: how does one extrapolate the number of genes in a population (the pan genome) and the number of genes that are found in all individuals in the population (the core genome) based on sample data alone?  Species definitions notwithstanding, Andrey felt that estimates depended on details of the alignment process utilized to define when two genes were grouped together.  Hence, he wanted to evaluate the sensitivity of core and pan geonme predictions to changes in alignment rules.  However, it became clear that something deeper was at stake.  We teamed up with Bart Haegeman, who was on an extended visit in my group from his INRIA group in Montpellier, to evaluate whether it was even possible to quantitatively predict pan and core genome sizes. We concluded that pan and core genome size estimates were far more problematic than had been acknowledged.  In fact, we concluded that they depended sensitively on estimating the number of rare genes and rare genomes, respectively.  The basic idea can be encapsulated in this figure:

The top panels show gene frequency distributions for two synthetically generated species.  Species A has a substantially smaller pan genome and a substantially larger core genome than does Species B.  However, when one synthetically generates a sample set of dozens, even hundreds of genomes, then the rare genes and genomes that correspond to differences in pan and core genome size, do not end up changing the sample rarefaction curves (seen at the bottom, where the green and blue symbols overlap).  Hence, extrapolation to the community size will not necessarily be able to accurately estimate the size of the pan and core genome, nor even which is larger!

As an alternative, we proposed a metric we termed “genomic fluidity” which captures the dissimilarity of genomes when comparing their gene composition.

The quantitative value of genomic fluidity of the population can be estimated robustly from the sample.  Moreover, even if the quantitative value depends on gene alignment parameters, its relative order is robust.  All of this work is described in our paper in BMC Genomics from 2011: Genomic fluidity: an integrative view of gene diversity within microbial populations.

However, as we were midway through our genomic fluidity paper, it occurred to us that there was one key element of this story that merited further investigation.  We had termed our metric “genomic fluidity” because it provided information on the degree to which genomes were “fluid“, i.e., comprised of different sets of genes.  The notion of fluidity also implies a dynamic, i.e., a mechanism by which genes move. Hence, I came up with a very minimal proposal for a model that could explain differences in genomic fluidity.  As it turns out, it can explain a lot more.

A null model: getting the basic concepts together
In Spring 2010, I began to explore a minimal, population-genetics style model which incorporated a key feature of genomic assays, that the gene composition of genomes differs substantially, even between taxonomically similar isolates. Hence, I thought it would be worthwhile to  analyze a model in which the total number of individuals in the population was fixed at N, and each individual had exactly M genes.  Bart and I started analyzing this together. My initial proposal was a very simple model that included three components: reproduction, mutation and gene transfer. In a reproduction step, a random individual would be selected, removed and then replaced with one of the remaining N-1 individuals.  Hence, this is exactly analogous to a Moran step in a standard neutral model.  At the time, what we termed mutation was actually representative of an uptake event, in which a random genome was selected, one of its genes was removed, and then replaced with a new gene, not found in any other of the genomes.  Finally, we considered a gene transfer step in which two genomes would be selected at random, and one gene from a given genome would be copied over to the second genome, removing one of the previous genes.  The model, with only birth-death (on left) and mutation (on right), which is what we eventually focused on for this first paper, can be depicted as follows:

We proceeded based on our physics and theoretical ecology backgrounds, by writing down master equations for genomic fluidity as a function of all three events. It is apparent that reproduction decreases genomic fluidity on average, because after a reproduction event, two genomes have exactly the same set of genes.  Likewise, gene transfer (in the original formulation) also decreases genomic fluidity on average, but the decrease is smaller by a factor of 1/M, because only one gene is transferred.  Finally, mutation increases genomic fluidity on average, because a mutation event occurring at a gene which had before occurred in more than one genome, introduces a new singleton gene in the population, hence increasing dissimilarity. The model was simple, based on physical principles, was analytically tractable, at least for average quantities like genomic fluidity, and moreover it had the right tension.  It considered a mechanism for fluidity to increase and two mechanisms for fluidity to decrease.  Hence, we thought this might provide a basis for thinking about how relative rates of birth-death, transfer and uptake might be identified from fluidity.  As it turns out, many combinations of such parameters lead to the same value of fluidity.  This is common in models, and is often referred to as an identifiability problem. However, the model could predict other things, which made it much more interesting.   


The making of the paper
The key moment when the basic model, described above, began to take shape as a paper occurred when we began to think about all the data that we were not including in our initial genomic fluidity analysis.  Most prominently, we were not considering the frequency at which genes occurred amongst different genomes.  In fact, gene frequency distributions had already attracted attention.  A gene frequency distribution summarizes the number of genes that appear in exactly k genomes. The frequency with which a gene appears is generally thought to imply something about its function, e.g., “Comprising the pan-genome are the core complement of genes common to all members of a species and a dispensable or accessory genome that is present in at leastbone but not all members of a species.” (Laing et al., BMC Bioinformatics 2011).  The emphasis is mine. But does one need to invoke selection, either implicitly or explicitly, to explain differences in gene frequency? 

As it turns out, gene frequency distributions end up having a U-shape, such that many genes appear in 1 or a few genomes, many in all genomes (or nearly all), and relatively few occur at intermediate levels.  We had extracted such gene frequency distributions from our limited dataset of ~100 genomes over 6 species.  Here is what they look like:

And, when we began to think more about our model, we realized that the tension that led to different values of genomic fluidity also generated the right sort of tension corresponding to U-shaped gene frequency distributions.  On the one-hand, mutations (e.g., uptake of new genes from the environment) would contribute to shifting the distribution to the left-hand-side of the U-shape.  On the other hand, birth-death would contribute to shifting the distribution to the right-hand side of the U-shape.  Gene transfer between genomes would also shift the distribution to the right. Hence, it seemed that for a given set of rates, it might be possible to generate reasonable fits to empirical data that would generate a U-shape. In doing so, that would mean that the U-shape was not nearly as informative as had been thought.  In fact, the U-shape could be anticipated from a neutral model in which one need not invoke selection. This is an important point as it came back to haunt us in our first round of review.

So, let me be clear: I do think that genes matter to the fitness of an organism and that if you delete/replace certain genes you will find this can have mild to severe to lethal costs (and occasional benefits).  However, our point in developing this model was to try and create a baseline null model, in the spirit of neutral theories of population genetics, that would be able to reproduce as much of the data with as few parameters as possible.  Doing so would then help identify what features of gene compositional variation could be used as a means to identify the signatures of adaptation and selection.  Perhaps this point does not even need to be stated, but obviously not everyone sees it the same way.  In fact, Eugene Koonin has made a similar argument in his nice paper, Are there laws of adaptive evolution: “the null hypothesis is that any observed pattern is first assumed to be the result of non-selective, stochastic processes, and only once this assumption is falsified, should one start to explore adaptive scenarios”.  I really like this quote, even if I don’t always follow this rule (perhaps I should). It’s just so tempting to explore adaptive scenarios first, but it doesn’t make it right.

At that point, we began extending the model in a few directions.  The major innovation was to formally map our model onto the infinitely many alleles model of population genetics, so that we could formally solve our model using the methods of coalescent theory for both cases of finite population sizes and for exponentially growing population sizes.  Bart led the charge on the analytics and here’s an example of the fits from the exponentially growing model (the x-axis is the number of genomes):

At that point, we had a model, solutions, fits to data, and a message.  We solicited a number of pre-reviews from colleagues who helped us improve the presentation (thank you for your help!).  So, we tried to publish it.    


Trying to publish the paper
We tried to publish this paper in two outlets before finding its home in BMC Genomics.  First, we submitted the article to PNAS using their new PNAS Plus format.  We submitted the paper in June 2011 and were rejected with an invitation to resubmit in July 2011. One reviewer liked the paper, apparently a lot: “I very much like the assumption of neutrality, and I think this provocative idea deserves publication.”  The same reviewer gave a number of useful and critical suggestions for improving the manuscript.  Another reviewer had a very strong negative reaction to the paper. Here was the central concern: “I feel that the authors’ conclusion that the processes shaping gene content in bacteria and primarily neutral are significantly false, and potentially confusing to readers who do not appreciate the lack of a good fit between predictions and data, and who do not realise that the U-shaped distributions observed would be expected  under models where it is selection that determines gene number.”  There was no disagreement over the method or the analysis.  The disagreement was one of what our message was.

I still am not sure how this confusion arose, because throughout our first submission and our final published version, we were clear that the point of the manuscript was to show that the U-shape of gene frequency distributions provide less information than might have been thought/expected about selection.  They are relatively easy to fit with a suite of null models.  Again, Koonin’s quote is very apt here, but at some basic level, we had an impasse over a philosophy of the type of science we were doing. Moreover, although it is clear that non-neutral processes are important, I would argue that it is also incorrect to presume that all genes are non-neutral.  There’s lots of evidence that many transferred genes have little to no effect on fitness. We revised the paper, including and solving alternative models with fixed and flexible core genomes, again showing that U-shapes are rather generic in this class of models.  We argued our point, but the editor sided with the negative review, rejecting our paper in November after resubmission in September, with the same split amongst the reviewers. 

Hence, we resubmitted the paper to Genome Biology, which rejected it at the editorial level after a few week delay without much of an explanation, and at that point, we decided to return to BMC Genomics, which we felt had been a good home for our first paper in this area and would likely make a good home for the follow-up.  A colleague once said that there should be an r-index, where r is the number of rejections a paper received before ultimate acceptance.  He argued that r-indices of 0 were likely not good (something about if you don’t fall, then you’re not trying) and an r-index of 10 was probably not good either.  I wonder what’s right or wrong. But I’ll take an r of 2 in this case, especially because I felt that the PNAS review process really helped to make the paper better even if it was ultimately rejected. And, by submitting to Genome Biology, we were able to move quickly to another journal in the same BMC consortia.

Upcoming plans
Bart Haegeman and I continue to work on this problem, from both the theory and bioinformatics side.  I find this problem incredibly fulfilling.  It turns out that there are many features of the model that we still have not fully investigated.  In addition, calculating gene frequency distributions involves a number of algorithmic challenges to scale-up to large datasets.  We are building a platform to help, to some extent, but are looking for collaborators who have specific algorithmic interests in these types of problems.  We are also in discussions with biologists who want to utilize these types of analysis to solve particular problems, e.g., how can the analysis of gene frequency distributions be made more informative with respect to understanding the role of genes in evolution and the importance of genes to fitness.  I realize there are more of such models out there tackling other problems in quantitative population genomics (we cite many of them in our BMC Genomics paper), including some in the same area of understanding the core/pan genome and gene frequency distributions. I look forward to learning from and contributing to these studies.