Guest post by Josh Weitz: IP vs. PI-s: On Intellectual Property and Intellectual Exchange in the Sciences and Engineering as Practiced in Academia

Today I am pleased to publish this guest post from my friend and colleague Joshua Weitz. He does some fantastically interesting research but that is not what he is writing about here. Instead he is focusing on intellectual property and sabbaticals …

Joshua S. Weitz
School of Biology and School of Physics
Georgia Institute of Technology


This blog entry is inspired by a recent personal experience. However, I believe it serves to illustrate a larger issue that requires open discussion amongst and between faculty, administrators, and the tech transfer offices of research universities.


I am temporarily based at the U of Arizona (hereafter UA) working in Matthew Sullivan’s group where I am on a 9 month leave from my home institution, Georgia Tech (hereafter GT), spanning the term: August 15, 2013 to May 15, 2014. I arrived on campus and began the process of signing various electronic forms to allow me to use the campus network, get keys, etc.

The gateway to this process was a request to sign a “Designated Campus Colleague (DCC)” agreement: The DCC agreement is the official way the UA establishes an “association” with long-term visitors who do not receive any form of payment from UA (as applies in my case). I would be an Associate of the Department of Ecology and Evolutionary Biology, where the Sullivan group is based. The privileges of the DCC include receiving a UA email, keys to an office, a “CatCard” (Cat = short for Wildcats, the UA mascot) which enables me to access the building after hours, as well as other discounts. The DCC makes it quite clear that I am not an employee and that I will receive no form of payment whatsoever from UA. Indeed, all of this is rather immaterial to my real reason for spending 9 months here – to work in closer collaboration with the Sullivan group on problems of mutual interest. The obligations I must affirm are largely boilerplate (i.e., do no harm) but do include the following identified in Clause 11 of the DCC:

“11. INTELLECTUAL PROPERTY – Associate hereby assigns to the ABOR all his or her right, title and interest to intellectual property created or invented by Associate in which the ABOR claims an ownership interest under its Intellectual Property Policy (the “ABOR IP Policy”). Associate agrees to promptly disclose such intellectual property as required by the ABOR IP Policy, and to sign all documents and do all things necessary and proper to effect this assignment of rights. Associate has not agreed (and will not agree) in consulting or other agreements to grant intellectual property rights to any other person or entity that would conflict with this assignment or with the ABORs’ ownership interests under the ABOR IP Policy”

This clause is worth reading twice. I did. I then refused to sign the agreement. This clause, my refusal to sign the agreement, and what happened next are the basis for this blog entry.

Note that the template upon which my particular agreement is based can be found on the UA website. For those unfamiliar with intellectual property (IP) clauses that universities routinely request visitors to sign, here is a small (random) sample:

  1. U of Texas
  2. U of Maryland Baltimore
  3. Emory University
  4. Washington University @ St. Louis
  5. Northwestern University

It would be possible to collect many more such agreements that apply to both employees and to visitors of research universities.

The issues
The central issue at stake is one of ownership of future intellectual property. Presumably a visitor has decided to visit U of X because U of X is the best in the world and that new, monetizable discoveries will emerge as a direct result of the “resources” at U of X. This may be true. Or, it may not be. However, rather than try and chase IP after it has been created (a tenuous legal position), universities would rather have all visitors assign future IP immediately upon arrival and sort it out later (if and when IP is generated).

This language of “hereby assigns” (as quoted above) is not chance legal-ese. To the contrary, it is clear that the legal staff and upper administration at universities are concerned with respect to the repercussions of Stanford vs. Roche. If you haven’t read up on this case, I encourage you to do so. The key take-away message is that the Supreme Court has ruled that inventions remain the property of the inventor to assign as they see fit unless the inventor assigns inventions to a specific entity (e.g., an employer). As the U of California memo to faculty makes clear, the way that many universities want to deal with this problem moving forward is to reword their employment contracts so that employees immediately assign their IP to their university, rather than “agree to assign” their IP. “Agree to assign” (a clause meant to describe a potential future action) was the language found in Stanford’s prior IP agreement and represented one key component of Stanford losing their Supreme Court case to Roche, to the disappointment of Stanford, other major research universities and the federal government.

Figure 1 – Supreme Court decision on Stanford vs. Roche, remainder available here:

However, scientific visits to other institutions create a complex contradiction whereby institutions ask their visitors to sign agreements that they would never want their own employees to sign while visiting other institutions!

These agreements make perfect sense if only one university retains aggressive IP language (i.e., they grab all the IP of visitors who spend time on their campus and give none away when their own employees become visitors elsewhere). Of course, none of this makes sense once all (or many) institutions adopt such policy as is now the case. For example, how does someone from the U of California system visit UA? Or, as I asked myself, how can I “hereby assign” IP rights to the Board of Regents of Arizona when I have already assigned them to the Board of Regents of Georgia. Moreover, how can I affirm that I have “not agreed (and will not agree) in consulting or other agreements to grant intellectual property rights to any other person or entity that would conflict with this assignment”. Obviously, any faculty visitor already has a prior IP agreement with their employer! What should be done? Well, before I make some recommendations, let me point out that not all visits are the same. Instead of trying for a single blanket IP policy (the approach taken by most companies and apparently co-opted by academia), consider the following scenarios.

A few scenarios

  • Case 1: The visitor learning techniques. Many experimental groups routinely welcome visitors to learn new techniques and practice established techniques on new equipment. In many cases, such visits have a dual benefit: first, the visitor (often a student) learns a new method that they can apply to their research problem; second, the host helps to spread the use of some technique or method that they may have had a hand in establishing. There is a strong degree of cooperation here, one that is common in many, but not all, branches of science and engineering. The objective of such visits is not to perform a key experiment or test but to learn the basic steps that could enable their own advances at some future point. 
  • Case 2: The visitor performing research experiments. In some instances, collaboration requires visits to a peer institution. Such visitors may stay for short- (e.g., = 1 week) periods. The purpose is to generate new scientific data that may, in turn, represent novel IP. The performance of such experiments requires some resource at the host institution, however, it is almost certainly the case that resources (whether personal or material) are also being contributed by the visitor.
  • Case 3: The short-term collaborative visitor. Scientists and engineers routinely visit each other. Why? Perhaps because they like talking about science with their peers or because they like learning about what is happening at other institutions or because they like talking to (and recruiting) students at other institutions, and in many cases, all of the above! Visits may range from 1-2 days (while giving a seminar) to a week (while spending focused time visiting a group that may kickstart a scientific collaboration). I have used the 1 week threshold rather arbitrarily, but it is helpful to classify visitors as either short- (= 1 week). 
  • Case 4: The long-term collaborative visitor. Similar to Case 3, albeit on a longer scale, typically associated with sabbatical visits. Such visits need not involve any use of University resources in the sense that the visitor does not conduct experiments, does not utilize lab equipment, resources, reagents, etc. The purpose of such visits may be to experience a distinct intellectual environment, to stimulate a long-term collaboration, and/or to embark on a new research direction.

Before specifying recommendations that address these case, I simply want to point out that these are different cases. Trying to treat all visitors equally with respect to IP conflicts with both the spirit of open scientific exchange and reflects poorly on the extent to which the drafters of such policy appreciate what takes place during scientific visits. Hence, let me pause and ask you, the reader, to consider: what sort of guidelines would you recommend for each of these cases? And, while you’re busy considering that question, let me also propose a unified equation to try and shed some light on factors that suggest that IP policies are both self-contradictory and self-defeating.

An illustrative “equation” of the costs and benefits of aggressive requests for control of the IP of institutional visitors

I propose the following equation to quantify the total amount of money generated by a scientific visitor to a host institution:

$ = P * M * F


P = Probability of the invention taking place, as a direct result of the scientific visit
M = the Monetary value of the invention, accrued over its lifetime
F = Fraction of the dollar value assigned to the host institution.

Indeed, I believe it worthwhile to examine the effect that such policies have on each component, to the extent that university IP policies for visitors are aimed to increase incoming revenues to their institutions. First, how do policies affect the P, i.e., the probability of the invention. Second, do they affect the Monetary value? Finally, how do they affect the Fraction of accrued revenue generated by visitor-initiated IP? Of note: it is clear that when a university asks a visitor to “hereby assign” IP rights, then they are trying to position themselves positively with respect to the 2011 Supreme Court ruling in the Stanford vs. Roche case. In other words, they don’t want to become another Stanford and lose out on potentially significant inventions. However, the key word is potentially.

P: I think it fair to say that such policies are also likely to have negative effects on the P component of this equation. If a visitor knows that a visit to another institution is likely to involve giving up IP, creating conflicts with the IP policy of their own institution, or wasting many hours in trying to resolve such conflicts, are they more or less likely to spend the sort of time and energy necessary to create the IP in the first place? I am quite happy to visit the Sullivan group, but certainly far less happy to be spending time on visitor agreements (although I hope this blog entry may be of service to others). Perhaps others may feels similarly.

M: I don’t know whether such policies have a neutral or negative effect on the Monetary scale of invention (and its future monetization). But, I can’t imagine such IP policies act to increase the monetary scale of invention for which the inventor gives all claim to an entity other than his/her employer without surety of compensation! Moreover, whatever monetary return to the university might ensue, such agreements are also likely to lead to legal costs due to the conflicts that these agreements generate, thereby decreasing the return on the invention.

F: The IP clauses claim to do an effective job of increasing the Fraction of revenue a host institution will acquire. Indeed, they claim to protect 100% of the IP-related revenue. However, given the fact that any such clause is almost certainly in direct contradiction to the employee agreement of their visitor, then it is not so clear which clause of which agreement takes precedence.

In summary, I hope this formula and discussion have shed some light on the costs and benefits of aggressive IP policies. I contend that aggressive IP policies act to increase one aspect of potential revenue but are likely to decrease two other aspects. Whether the cumulative effect is positive or negative remains a topic for someone else’s blog (or study). In any case, the most likely outcome of visits is not necessarily IP but ideas. These ideas are much more likely to be shared in the public domain and perhaps even be the basis for collaborative research proposals to governmental and non-governmental funding groups. If at the end of the day, the university administration is counting $, irrespective of where it is generated, then it would be far more sensible to generate policies that would not favor one type of revenue stream (licensing derived from IP) over another (direct and indirect costs generated from grants). To reiterate: I suggest that aggressive IP policies are likely to negatively effect the probability that interactions occur that lead to monetizable ideas in the first place. Of course, such IP clauses may also have negative effects on the pursuit of knowledge, generation of knowledge, etc. (but wait, that was never the issue anyway).

I am not a lawyer. As such, the recommendations I lay out should not be mistaken for legal advice. Rather, I hope they are viewed as a few practical guidelines to avoid creating legal impossibilities and, in the process, diminishing rather than augmenting the likelihood that meaningful scientific exchange leads to the type of knowledge that will benefit society at large (and in some cases, stimulate revenues for institutions and individuals for whom this is important). These recommendations are meant to apply to instances where visitors from other US universities or the government do not receive a salary/stipend/benefits and therefore are not employees. Cases where a visitor is paid a stipend/salary/benefits suggest a form of employee-employer relationship that involve distinct contractual obligations. Cases of visitors from industry and/or foreign institutions may require separate treatment.

1. Clauses regarding partial/full assignment/protection of IP should not be considered a default standard in visitor agreements unless (i) the visitor will perform experiments and/or directly utilize laboratory equipment, reagents and materials; (ii) the visitor will discuss, view, or in any way interact with proprietary information/materials owned by the university. Identifying such cases could be addressed via simple online questionnaires when establishing the agreement.

2. In the event that language regarding assignment/protection of IP is necessary, such clauses in visitor agreements should not attempt to take primacy over the IP assignment that the visitor almost certainly has signed with their employee. Rather, they should specify their rights to ownership over their employee’s portion of IP generated as a result of the collaborative visit. At the end of the day, if IP is generated, then the scientists and engineers involved at both institutions will either come to a satisfactory division of percentage stakes or not. Technology transfer offices at major research universities routinely interact with each other and, I imagine, would be receptive to such a collaborative approach that they routinely practice, irrespective of the rules on the books that may have emerged from other administrative offices.
Thankfully, I believe that an excellent model for the default IP clause in visitor agreement already exists! It is part of the Visitor Agreement to the Kavli Institute of Theoretical Physics at UC Santa Barbara. I have highlighted the clauses that I think should become the new default:

“INTELLECTUAL PROPERTY – To the extent legally permissible and subject to any overriding UC obligations to third parties, your home institution may retain ownership of any patentable inventions or copyrightable software you may develop during your work at KITP while participating in a KITP program as long as: (1) you will be visiting KITP for less than one year; (2) you do not need to be entered into UCSB’s official payroll system; (3) any financial support provided to you by KITP is for travel, living expenses and other similar costs and not to support direct research activities/projects (participation in the KITP program is not considered a direct research activity); and, (4) your activities occur in KITP facilities, only. Please note that if you engage in any research-related activities outside of KITP facilities and programs, the UC’s standard intellectual property policies, which require the UC to own intellectual property developed by visitors using UC facilities, will apply”

Figure 2 – KITP, indeed why would you go elsewhere on campus if you were based here!

The benefits of such a clause is that it assumes the default mode of a visit is for scientific collaboration. That is absolutely correct! Further, it does not try and replace the established agreement of visitors with their home institutions. Indeed, other institutions have tried (to some extent) to address this point, e.g., that of Stanford University which explicitly creates an alternative agreement if a prior employer agreement is already in place: SU 18A. A simple questionnaire could be used to help identify cases that require further discussion. The key here is simple, since creating yet another bureaucratic layer of complexity to visits is not what the scientific community wants nor needs. Indeed, a final recommendation should be:

3. IP agreements should not become an element of short-term visits where laboratory access is not needed (i.e., as part of seminars, symposia, mini-conferences, etc.).

Of course, some might argue: all of this is moot since universities should not be in the business of generating, retaining and fighting for IP that is created with tax dollars, but should give away access to all published inventions. That perspective is important and discussed extensively elsewhere. However, the issue at stake is that for some institutions, IP related revenue is a significant portion of income and this is a likely driver of the increasingly aggressive visitor IP policies that are unlikely to disappear.

In closing, I hope that this entry stimulates some discussion and perhaps even productive conversations to minimize the extent to which IP clauses act at cross-purposes with the visits of Principle Investigators, postdocs and students between peer institutions.


All parties, both at UA and GT, have been incredibly helpful and highly sympathetic as I explained my rationale for taking issue with the IP clause of my visitor agreement. Perhaps they were also surprised that I read the agreement. In the interest of expediting the process, I handed over my case to the appropriate representatives at GT and at UA. After some discussion, they found a way forward. Recall that the original IP clause in my visitor agreement includes a reference to claims of ownership under the “ABOR IP Policy”. The relevant ABOR IP policy applies to two types of individuals:

  1. Any intellectual property created by a university or Board employee in the course and scope of employment, and
  2. Any intellectual property created with the significant use of Board or university resources, unless otherwise provided in an authorized agreement for the use of those resources.”

The terms of my visitor agreement make it clear that I am not an employee. Hence case 1 does not apply Second, because of my situation as a theoretician, I do not plan on using any UA laboratories, equipment, materials, etc.. Moreover, all of my computational-based research will continue to be conducted on GT-owned or personally owned computers. Hence, the only “resources” I plan to utilize are: (i) an office; (ii) the internet. Both UA and GT agree that such activities do not cross the threshold for “significant use” of resources. Hence, case 2 does not apply. So long as my use does not change, then UA should not have standing to claim any IP and my signing of this agreement would not conflict with my employee contract at GT. This understanding now has a paper trail.

As of mid-September 2013 (one month or so after my initial refusal to sign the DCC), I have now signed the agreement and am now officially a Visiting Professor in the Department of Ecology and Evolutionary Biology at the University of Arizona.

Guest post from Joshua Weitz: Talking about the PI Sabbatical Beforehand: A Brief Guide for Faculty, Postdocs, and PhD Students in the Sciences

Guest Post by Joshua Weitz, Associate Professor, School of Biology and School of Physics, Georgia Institute of Technology


I direct a theoretical ecology and quantitative biology group based in the School of Biology at Georgia Tech.  I am going on a 9 month “leave” (Georgia Tech does not call them sabbaticals) to the Department of Ecology and Evolutionary Biology at the U of Arizona in Tucson, AZ from August 2013-May 2014 where I will be based in Matthew Sullivan’s Tucson Marine Phage Laboratory.  In preparation for this leave, our group held an interactive discussion on challenges and opportunities arising from the PI sabbatical for faculty, postdocs and PhD students in the sciences.  The discussion took place in four parts in a one-hour period.  Below I describe the setup of the discussion followed by specific recommendations for faculty, postdocs and PhD students prior to the PI sabbatical. 
How to Talk about the PI Sabbatical
Part 1 – the setup: I asked for a show of hands of group members who had thought about how my sabbatical would change the group and its dynamics?  Nearly all members raised their hands.  When asked, the group members also noted that they were most concerned about how the sabbatical would affect them.  Hence, in an effort to try and understand the effect of the sabbatical on all members, we split into three small discussion groups which were asked to identify challenges and opportunities for (i) the PI; (ii) postdocs; (iii) PhD students. 
Part 2 – small group discussion: The individual groups talked about how the sabbatical would affect different group members.  There are currently 9 members in the group (not including the PI), so we divided into three groups of three (I did not actively participate in the small group discussions, but did check in on all three groups).  The PI group was comprised of one postdoc, 1 graduate and 1 undergraduate.  The postdoc group had 1 postdoc and 2 graduate students.  The grad student group had 3 graduate students.  Hence, the first item of discussion was an effort to identify issues at stake at each career stage.  Then, the groups began to discuss how the sabbatical might change business as usual.  The groups spoke for ~15 minutes.
Part 3 – reporting: Challenges and opportunities were identified for each of the three categories.  A number of salient themes emerged that serve as general recommendations.   The consensus was that a number of common themes would emerge prior to a sabbatical although each science research group may differ in its own interactions.  Our presumption is that the PI was going alone and this shaped the nature of our recommendations.  First, as suggested by one of the students, there was a sense that the PI sabbatical would lead the students into a “Spiderman situation” in the sense that “with great power comes great responsibility”.  The PI sabbatical would lead to greater independence for group members and that this independence involves greater need for self-motivation, taking a holistic (long-term) view of one’s research, and increased pre-planning given the changes to the PI’s availability.  Second, clear communication is essential. For example, if a PI plans to be incommunicado for long stretches of time, this may be manageable (even if non-ideal from a student perspective), so long as provisions have been made to handle both the administrative and research duties that the PI normally would handle.  As a rule of thumb, the greater the change in PI availability, the greater the need for pre-planning to ensure that students and postdocs remain on track for research, career and personal development goals.
Part 4 – the view of the PI:  I provided additional feedback, tailored to the group and specifically addressed an issue that could create the most anxiety: my availability for one-on-one interactions.  I also distributed an initial recommendation list, modified here in light of group discussion. 

Five Specific pre-Sabbatical Recommendations for Faculty, Postdocs and PhD Students

1.     Develop a plan for your year ahead: what are the key goals for the sabbatical?
2.     Identify what is going to be different and what is going to remain the same: e.g., a new project(s), less/no teaching; less/no administrative duties, a new interaction schedule with the group, etc.
3.     Communicate your plans for the sabbatical and your expectations of group members to the group (ideally, after a group discussion of the kind outlined here).
4.     Talk to your Chair about expectations for your year and new expectations (if any) upon your return (and talk to your departmental admin team to make sure they are aware of your plans).
5.     Establish new interactions with your local host and host community.
1.     Establish a regular schedule of interactions with your adviser.
2.     Keep focused on your research & career goals (i.e., do not become a proxy adviser in the absence of the PI, i.e., see 3 & 4 below)
3.     Determine your supervisory responsibilities – what is your (limited) role to advise the students, technicians in the group?
4.     Determine your lab management responsibilities – what is your (limited) role in ordering and other admin duties?
5.     Travel to collaborators and mentors – do not just stay put while your adviser is away.
PhD students
1.     Identify the major research and career development expectations during your adviser’s time away – how will the adviser’s absence affect your thesis (if at all)?
2.     Establish a regular schedule of interactions with your adviser and senior members of the group.
3.     Contact your adviser, even off-schedule if you really need advice.
4.     Remember: your PI’s sabbatical is an opportunity for independence, increased self-motivated work and development as a scholar, not a “holiday”.
5.     Identify a local faculty member who can serve as an occasional resource to provide input and thoughts on your thesis work (this should be coordinated in advance, with your PI).
Final thoughts:
A PI sabbatical can be a very positive opportunity for all group members to become more independent, to set off on new directions, and to bring greater creativity and productivity to a group.  However, two notes of caution.  First, if you are not yet in a group, think carefully before joining with an absent PI, as the initial period in a group (regardless of your status) often sets the frame for the long-term interaction.  Second, the PI remains the PI, so be wary of a sabbatical plan that involves anyone other than the PI becoming the acting group leader.   Although certain senior members may take over duties, the sabbatical plan should (ideally) involve availability of the PI to make key decisions critical to the group, including thesis advancement, hiring/firing and mediation of major conflicts.

And, I suppose I’ll have to revisit this guide next year to report back on what worked and what we should have thought of in advance!


  •  Dr. Joshua Weitz
  •  Dr. Alexander Bucksch
  •  Dr. Michael Cortez
  •  Abhiram Das, PhD candidate
  •  Cesar Flores, PhD candidate
  •  Luis Jover, PhD candidate
  •  Gabriel Mitchell, PhD candidate
  •  Bradford Taylor, PhD candidate
  •  Charles Wigington, PhD candidate
  •  Victoria Chou, NSF REU student
Further reading
Many blogs are available detailing sabbatical “adventures” and “diaries”.

Guest post: Story Behind the Paper by Joshua Weitz on Neutral Theory of Genome Evolution

I am very pleased to have another in my “Story behind the paper” series of guest posts.  This one is from my friend and colleague Josh Weitz from Georgia Tech regarding a recent paper of his in BMC Genomics.  As I have said before – if you have published an open access paper on a topic related to this blog and want to do a similar type of guest post let me know …

A guest blog by Joshua Weitz, School of Biology and Physics, Georgia Institute of Technology

Summary This is a short, well sort-of-short, story of the making of our paper: “A neutral theory of genome evolution and the frequency distribution of genes” recently published in BMC Genomics. I like the story-behind-the-paper concept because it helps to shed light on what really happens as papers move from ideas to completion. It’s something we talk about in group meetings but it’s nice to contribute an entry in this type of forum.  I am also reminded in writing this blog entry just how long science can take, even when, at least in this case, it was relatively fast. 

The pre-history The story behind this paper began when my former PhD student, Andrey Kislyuk (who is now a Software Engineer at DNAnexus) approached me in October 2009 with a paper by Herve Tettelin and colleagues.  He had read the paper in a class organized by Nicholas Bergman (now at NBACC). The Tettelin paper is a classic, and deservedly so.  It unified discussions of gene variation between genomes of highly similar isolates by estimating the total size of the pan and core genome within multiple sequenced isolates of the pathogen Streptococcus agalactiae.  

However, there was one issue that we felt could be improved: how does one extrapolate the number of genes in a population (the pan genome) and the number of genes that are found in all individuals in the population (the core genome) based on sample data alone?  Species definitions notwithstanding, Andrey felt that estimates depended on details of the alignment process utilized to define when two genes were grouped together.  Hence, he wanted to evaluate the sensitivity of core and pan geonme predictions to changes in alignment rules.  However, it became clear that something deeper was at stake.  We teamed up with Bart Haegeman, who was on an extended visit in my group from his INRIA group in Montpellier, to evaluate whether it was even possible to quantitatively predict pan and core genome sizes. We concluded that pan and core genome size estimates were far more problematic than had been acknowledged.  In fact, we concluded that they depended sensitively on estimating the number of rare genes and rare genomes, respectively.  The basic idea can be encapsulated in this figure:

The top panels show gene frequency distributions for two synthetically generated species.  Species A has a substantially smaller pan genome and a substantially larger core genome than does Species B.  However, when one synthetically generates a sample set of dozens, even hundreds of genomes, then the rare genes and genomes that correspond to differences in pan and core genome size, do not end up changing the sample rarefaction curves (seen at the bottom, where the green and blue symbols overlap).  Hence, extrapolation to the community size will not necessarily be able to accurately estimate the size of the pan and core genome, nor even which is larger!

As an alternative, we proposed a metric we termed “genomic fluidity” which captures the dissimilarity of genomes when comparing their gene composition.

The quantitative value of genomic fluidity of the population can be estimated robustly from the sample.  Moreover, even if the quantitative value depends on gene alignment parameters, its relative order is robust.  All of this work is described in our paper in BMC Genomics from 2011: Genomic fluidity: an integrative view of gene diversity within microbial populations.

However, as we were midway through our genomic fluidity paper, it occurred to us that there was one key element of this story that merited further investigation.  We had termed our metric “genomic fluidity” because it provided information on the degree to which genomes were “fluid“, i.e., comprised of different sets of genes.  The notion of fluidity also implies a dynamic, i.e., a mechanism by which genes move. Hence, I came up with a very minimal proposal for a model that could explain differences in genomic fluidity.  As it turns out, it can explain a lot more.

A null model: getting the basic concepts together
In Spring 2010, I began to explore a minimal, population-genetics style model which incorporated a key feature of genomic assays, that the gene composition of genomes differs substantially, even between taxonomically similar isolates. Hence, I thought it would be worthwhile to  analyze a model in which the total number of individuals in the population was fixed at N, and each individual had exactly M genes.  Bart and I started analyzing this together. My initial proposal was a very simple model that included three components: reproduction, mutation and gene transfer. In a reproduction step, a random individual would be selected, removed and then replaced with one of the remaining N-1 individuals.  Hence, this is exactly analogous to a Moran step in a standard neutral model.  At the time, what we termed mutation was actually representative of an uptake event, in which a random genome was selected, one of its genes was removed, and then replaced with a new gene, not found in any other of the genomes.  Finally, we considered a gene transfer step in which two genomes would be selected at random, and one gene from a given genome would be copied over to the second genome, removing one of the previous genes.  The model, with only birth-death (on left) and mutation (on right), which is what we eventually focused on for this first paper, can be depicted as follows:

We proceeded based on our physics and theoretical ecology backgrounds, by writing down master equations for genomic fluidity as a function of all three events. It is apparent that reproduction decreases genomic fluidity on average, because after a reproduction event, two genomes have exactly the same set of genes.  Likewise, gene transfer (in the original formulation) also decreases genomic fluidity on average, but the decrease is smaller by a factor of 1/M, because only one gene is transferred.  Finally, mutation increases genomic fluidity on average, because a mutation event occurring at a gene which had before occurred in more than one genome, introduces a new singleton gene in the population, hence increasing dissimilarity. The model was simple, based on physical principles, was analytically tractable, at least for average quantities like genomic fluidity, and moreover it had the right tension.  It considered a mechanism for fluidity to increase and two mechanisms for fluidity to decrease.  Hence, we thought this might provide a basis for thinking about how relative rates of birth-death, transfer and uptake might be identified from fluidity.  As it turns out, many combinations of such parameters lead to the same value of fluidity.  This is common in models, and is often referred to as an identifiability problem. However, the model could predict other things, which made it much more interesting.   

The making of the paper
The key moment when the basic model, described above, began to take shape as a paper occurred when we began to think about all the data that we were not including in our initial genomic fluidity analysis.  Most prominently, we were not considering the frequency at which genes occurred amongst different genomes.  In fact, gene frequency distributions had already attracted attention.  A gene frequency distribution summarizes the number of genes that appear in exactly k genomes. The frequency with which a gene appears is generally thought to imply something about its function, e.g., “Comprising the pan-genome are the core complement of genes common to all members of a species and a dispensable or accessory genome that is present in at leastbone but not all members of a species.” (Laing et al., BMC Bioinformatics 2011).  The emphasis is mine. But does one need to invoke selection, either implicitly or explicitly, to explain differences in gene frequency? 

As it turns out, gene frequency distributions end up having a U-shape, such that many genes appear in 1 or a few genomes, many in all genomes (or nearly all), and relatively few occur at intermediate levels.  We had extracted such gene frequency distributions from our limited dataset of ~100 genomes over 6 species.  Here is what they look like:

And, when we began to think more about our model, we realized that the tension that led to different values of genomic fluidity also generated the right sort of tension corresponding to U-shaped gene frequency distributions.  On the one-hand, mutations (e.g., uptake of new genes from the environment) would contribute to shifting the distribution to the left-hand-side of the U-shape.  On the other hand, birth-death would contribute to shifting the distribution to the right-hand side of the U-shape.  Gene transfer between genomes would also shift the distribution to the right. Hence, it seemed that for a given set of rates, it might be possible to generate reasonable fits to empirical data that would generate a U-shape. In doing so, that would mean that the U-shape was not nearly as informative as had been thought.  In fact, the U-shape could be anticipated from a neutral model in which one need not invoke selection. This is an important point as it came back to haunt us in our first round of review.

So, let me be clear: I do think that genes matter to the fitness of an organism and that if you delete/replace certain genes you will find this can have mild to severe to lethal costs (and occasional benefits).  However, our point in developing this model was to try and create a baseline null model, in the spirit of neutral theories of population genetics, that would be able to reproduce as much of the data with as few parameters as possible.  Doing so would then help identify what features of gene compositional variation could be used as a means to identify the signatures of adaptation and selection.  Perhaps this point does not even need to be stated, but obviously not everyone sees it the same way.  In fact, Eugene Koonin has made a similar argument in his nice paper, Are there laws of adaptive evolution: “the null hypothesis is that any observed pattern is first assumed to be the result of non-selective, stochastic processes, and only once this assumption is falsified, should one start to explore adaptive scenarios”.  I really like this quote, even if I don’t always follow this rule (perhaps I should). It’s just so tempting to explore adaptive scenarios first, but it doesn’t make it right.

At that point, we began extending the model in a few directions.  The major innovation was to formally map our model onto the infinitely many alleles model of population genetics, so that we could formally solve our model using the methods of coalescent theory for both cases of finite population sizes and for exponentially growing population sizes.  Bart led the charge on the analytics and here’s an example of the fits from the exponentially growing model (the x-axis is the number of genomes):

At that point, we had a model, solutions, fits to data, and a message.  We solicited a number of pre-reviews from colleagues who helped us improve the presentation (thank you for your help!).  So, we tried to publish it.    

Trying to publish the paper
We tried to publish this paper in two outlets before finding its home in BMC Genomics.  First, we submitted the article to PNAS using their new PNAS Plus format.  We submitted the paper in June 2011 and were rejected with an invitation to resubmit in July 2011. One reviewer liked the paper, apparently a lot: “I very much like the assumption of neutrality, and I think this provocative idea deserves publication.”  The same reviewer gave a number of useful and critical suggestions for improving the manuscript.  Another reviewer had a very strong negative reaction to the paper. Here was the central concern: “I feel that the authors’ conclusion that the processes shaping gene content in bacteria and primarily neutral are significantly false, and potentially confusing to readers who do not appreciate the lack of a good fit between predictions and data, and who do not realise that the U-shaped distributions observed would be expected  under models where it is selection that determines gene number.”  There was no disagreement over the method or the analysis.  The disagreement was one of what our message was.

I still am not sure how this confusion arose, because throughout our first submission and our final published version, we were clear that the point of the manuscript was to show that the U-shape of gene frequency distributions provide less information than might have been thought/expected about selection.  They are relatively easy to fit with a suite of null models.  Again, Koonin’s quote is very apt here, but at some basic level, we had an impasse over a philosophy of the type of science we were doing. Moreover, although it is clear that non-neutral processes are important, I would argue that it is also incorrect to presume that all genes are non-neutral.  There’s lots of evidence that many transferred genes have little to no effect on fitness. We revised the paper, including and solving alternative models with fixed and flexible core genomes, again showing that U-shapes are rather generic in this class of models.  We argued our point, but the editor sided with the negative review, rejecting our paper in November after resubmission in September, with the same split amongst the reviewers. 

Hence, we resubmitted the paper to Genome Biology, which rejected it at the editorial level after a few week delay without much of an explanation, and at that point, we decided to return to BMC Genomics, which we felt had been a good home for our first paper in this area and would likely make a good home for the follow-up.  A colleague once said that there should be an r-index, where r is the number of rejections a paper received before ultimate acceptance.  He argued that r-indices of 0 were likely not good (something about if you don’t fall, then you’re not trying) and an r-index of 10 was probably not good either.  I wonder what’s right or wrong. But I’ll take an r of 2 in this case, especially because I felt that the PNAS review process really helped to make the paper better even if it was ultimately rejected. And, by submitting to Genome Biology, we were able to move quickly to another journal in the same BMC consortia.

Upcoming plans
Bart Haegeman and I continue to work on this problem, from both the theory and bioinformatics side.  I find this problem incredibly fulfilling.  It turns out that there are many features of the model that we still have not fully investigated.  In addition, calculating gene frequency distributions involves a number of algorithmic challenges to scale-up to large datasets.  We are building a platform to help, to some extent, but are looking for collaborators who have specific algorithmic interests in these types of problems.  We are also in discussions with biologists who want to utilize these types of analysis to solve particular problems, e.g., how can the analysis of gene frequency distributions be made more informative with respect to understanding the role of genes in evolution and the importance of genes to fitness.  I realize there are more of such models out there tackling other problems in quantitative population genomics (we cite many of them in our BMC Genomics paper), including some in the same area of understanding the core/pan genome and gene frequency distributions. I look forward to learning from and contributing to these studies.