Guest post by Josh Weitz: IP vs. PI-s: On Intellectual Property and Intellectual Exchange in the Sciences and Engineering as Practiced in Academia

Today I am pleased to publish this guest post from my friend and colleague Joshua Weitz. He does some fantastically interesting research but that is not what he is writing about here. Instead he is focusing on intellectual property and sabbaticals …

Joshua S. Weitz
School of Biology and School of Physics
Georgia Institute of Technology


This blog entry is inspired by a recent personal experience. However, I believe it serves to illustrate a larger issue that requires open discussion amongst and between faculty, administrators, and the tech transfer offices of research universities.


I am temporarily based at the U of Arizona (hereafter UA) working in Matthew Sullivan’s group where I am on a 9 month leave from my home institution, Georgia Tech (hereafter GT), spanning the term: August 15, 2013 to May 15, 2014. I arrived on campus and began the process of signing various electronic forms to allow me to use the campus network, get keys, etc.

The gateway to this process was a request to sign a “Designated Campus Colleague (DCC)” agreement: The DCC agreement is the official way the UA establishes an “association” with long-term visitors who do not receive any form of payment from UA (as applies in my case). I would be an Associate of the Department of Ecology and Evolutionary Biology, where the Sullivan group is based. The privileges of the DCC include receiving a UA email, keys to an office, a “CatCard” (Cat = short for Wildcats, the UA mascot) which enables me to access the building after hours, as well as other discounts. The DCC makes it quite clear that I am not an employee and that I will receive no form of payment whatsoever from UA. Indeed, all of this is rather immaterial to my real reason for spending 9 months here – to work in closer collaboration with the Sullivan group on problems of mutual interest. The obligations I must affirm are largely boilerplate (i.e., do no harm) but do include the following identified in Clause 11 of the DCC:

“11. INTELLECTUAL PROPERTY – Associate hereby assigns to the ABOR all his or her right, title and interest to intellectual property created or invented by Associate in which the ABOR claims an ownership interest under its Intellectual Property Policy (the “ABOR IP Policy”). Associate agrees to promptly disclose such intellectual property as required by the ABOR IP Policy, and to sign all documents and do all things necessary and proper to effect this assignment of rights. Associate has not agreed (and will not agree) in consulting or other agreements to grant intellectual property rights to any other person or entity that would conflict with this assignment or with the ABORs’ ownership interests under the ABOR IP Policy”

This clause is worth reading twice. I did. I then refused to sign the agreement. This clause, my refusal to sign the agreement, and what happened next are the basis for this blog entry.

Note that the template upon which my particular agreement is based can be found on the UA website. For those unfamiliar with intellectual property (IP) clauses that universities routinely request visitors to sign, here is a small (random) sample:

  1. U of Texas
  2. U of Maryland Baltimore
  3. Emory University
  4. Washington University @ St. Louis
  5. Northwestern University

It would be possible to collect many more such agreements that apply to both employees and to visitors of research universities.

The issues
The central issue at stake is one of ownership of future intellectual property. Presumably a visitor has decided to visit U of X because U of X is the best in the world and that new, monetizable discoveries will emerge as a direct result of the “resources” at U of X. This may be true. Or, it may not be. However, rather than try and chase IP after it has been created (a tenuous legal position), universities would rather have all visitors assign future IP immediately upon arrival and sort it out later (if and when IP is generated).

This language of “hereby assigns” (as quoted above) is not chance legal-ese. To the contrary, it is clear that the legal staff and upper administration at universities are concerned with respect to the repercussions of Stanford vs. Roche. If you haven’t read up on this case, I encourage you to do so. The key take-away message is that the Supreme Court has ruled that inventions remain the property of the inventor to assign as they see fit unless the inventor assigns inventions to a specific entity (e.g., an employer). As the U of California memo to faculty makes clear, the way that many universities want to deal with this problem moving forward is to reword their employment contracts so that employees immediately assign their IP to their university, rather than “agree to assign” their IP. “Agree to assign” (a clause meant to describe a potential future action) was the language found in Stanford’s prior IP agreement and represented one key component of Stanford losing their Supreme Court case to Roche, to the disappointment of Stanford, other major research universities and the federal government.

Figure 1 – Supreme Court decision on Stanford vs. Roche, remainder available here:

However, scientific visits to other institutions create a complex contradiction whereby institutions ask their visitors to sign agreements that they would never want their own employees to sign while visiting other institutions!

These agreements make perfect sense if only one university retains aggressive IP language (i.e., they grab all the IP of visitors who spend time on their campus and give none away when their own employees become visitors elsewhere). Of course, none of this makes sense once all (or many) institutions adopt such policy as is now the case. For example, how does someone from the U of California system visit UA? Or, as I asked myself, how can I “hereby assign” IP rights to the Board of Regents of Arizona when I have already assigned them to the Board of Regents of Georgia. Moreover, how can I affirm that I have “not agreed (and will not agree) in consulting or other agreements to grant intellectual property rights to any other person or entity that would conflict with this assignment”. Obviously, any faculty visitor already has a prior IP agreement with their employer! What should be done? Well, before I make some recommendations, let me point out that not all visits are the same. Instead of trying for a single blanket IP policy (the approach taken by most companies and apparently co-opted by academia), consider the following scenarios.

A few scenarios

  • Case 1: The visitor learning techniques. Many experimental groups routinely welcome visitors to learn new techniques and practice established techniques on new equipment. In many cases, such visits have a dual benefit: first, the visitor (often a student) learns a new method that they can apply to their research problem; second, the host helps to spread the use of some technique or method that they may have had a hand in establishing. There is a strong degree of cooperation here, one that is common in many, but not all, branches of science and engineering. The objective of such visits is not to perform a key experiment or test but to learn the basic steps that could enable their own advances at some future point. 
  • Case 2: The visitor performing research experiments. In some instances, collaboration requires visits to a peer institution. Such visitors may stay for short- (e.g., = 1 week) periods. The purpose is to generate new scientific data that may, in turn, represent novel IP. The performance of such experiments requires some resource at the host institution, however, it is almost certainly the case that resources (whether personal or material) are also being contributed by the visitor.
  • Case 3: The short-term collaborative visitor. Scientists and engineers routinely visit each other. Why? Perhaps because they like talking about science with their peers or because they like learning about what is happening at other institutions or because they like talking to (and recruiting) students at other institutions, and in many cases, all of the above! Visits may range from 1-2 days (while giving a seminar) to a week (while spending focused time visiting a group that may kickstart a scientific collaboration). I have used the 1 week threshold rather arbitrarily, but it is helpful to classify visitors as either short- (= 1 week). 
  • Case 4: The long-term collaborative visitor. Similar to Case 3, albeit on a longer scale, typically associated with sabbatical visits. Such visits need not involve any use of University resources in the sense that the visitor does not conduct experiments, does not utilize lab equipment, resources, reagents, etc. The purpose of such visits may be to experience a distinct intellectual environment, to stimulate a long-term collaboration, and/or to embark on a new research direction.

Before specifying recommendations that address these case, I simply want to point out that these are different cases. Trying to treat all visitors equally with respect to IP conflicts with both the spirit of open scientific exchange and reflects poorly on the extent to which the drafters of such policy appreciate what takes place during scientific visits. Hence, let me pause and ask you, the reader, to consider: what sort of guidelines would you recommend for each of these cases? And, while you’re busy considering that question, let me also propose a unified equation to try and shed some light on factors that suggest that IP policies are both self-contradictory and self-defeating.

An illustrative “equation” of the costs and benefits of aggressive requests for control of the IP of institutional visitors

I propose the following equation to quantify the total amount of money generated by a scientific visitor to a host institution:

$ = P * M * F


P = Probability of the invention taking place, as a direct result of the scientific visit
M = the Monetary value of the invention, accrued over its lifetime
F = Fraction of the dollar value assigned to the host institution.

Indeed, I believe it worthwhile to examine the effect that such policies have on each component, to the extent that university IP policies for visitors are aimed to increase incoming revenues to their institutions. First, how do policies affect the P, i.e., the probability of the invention. Second, do they affect the Monetary value? Finally, how do they affect the Fraction of accrued revenue generated by visitor-initiated IP? Of note: it is clear that when a university asks a visitor to “hereby assign” IP rights, then they are trying to position themselves positively with respect to the 2011 Supreme Court ruling in the Stanford vs. Roche case. In other words, they don’t want to become another Stanford and lose out on potentially significant inventions. However, the key word is potentially.

P: I think it fair to say that such policies are also likely to have negative effects on the P component of this equation. If a visitor knows that a visit to another institution is likely to involve giving up IP, creating conflicts with the IP policy of their own institution, or wasting many hours in trying to resolve such conflicts, are they more or less likely to spend the sort of time and energy necessary to create the IP in the first place? I am quite happy to visit the Sullivan group, but certainly far less happy to be spending time on visitor agreements (although I hope this blog entry may be of service to others). Perhaps others may feels similarly.

M: I don’t know whether such policies have a neutral or negative effect on the Monetary scale of invention (and its future monetization). But, I can’t imagine such IP policies act to increase the monetary scale of invention for which the inventor gives all claim to an entity other than his/her employer without surety of compensation! Moreover, whatever monetary return to the university might ensue, such agreements are also likely to lead to legal costs due to the conflicts that these agreements generate, thereby decreasing the return on the invention.

F: The IP clauses claim to do an effective job of increasing the Fraction of revenue a host institution will acquire. Indeed, they claim to protect 100% of the IP-related revenue. However, given the fact that any such clause is almost certainly in direct contradiction to the employee agreement of their visitor, then it is not so clear which clause of which agreement takes precedence.

In summary, I hope this formula and discussion have shed some light on the costs and benefits of aggressive IP policies. I contend that aggressive IP policies act to increase one aspect of potential revenue but are likely to decrease two other aspects. Whether the cumulative effect is positive or negative remains a topic for someone else’s blog (or study). In any case, the most likely outcome of visits is not necessarily IP but ideas. These ideas are much more likely to be shared in the public domain and perhaps even be the basis for collaborative research proposals to governmental and non-governmental funding groups. If at the end of the day, the university administration is counting $, irrespective of where it is generated, then it would be far more sensible to generate policies that would not favor one type of revenue stream (licensing derived from IP) over another (direct and indirect costs generated from grants). To reiterate: I suggest that aggressive IP policies are likely to negatively effect the probability that interactions occur that lead to monetizable ideas in the first place. Of course, such IP clauses may also have negative effects on the pursuit of knowledge, generation of knowledge, etc. (but wait, that was never the issue anyway).

I am not a lawyer. As such, the recommendations I lay out should not be mistaken for legal advice. Rather, I hope they are viewed as a few practical guidelines to avoid creating legal impossibilities and, in the process, diminishing rather than augmenting the likelihood that meaningful scientific exchange leads to the type of knowledge that will benefit society at large (and in some cases, stimulate revenues for institutions and individuals for whom this is important). These recommendations are meant to apply to instances where visitors from other US universities or the government do not receive a salary/stipend/benefits and therefore are not employees. Cases where a visitor is paid a stipend/salary/benefits suggest a form of employee-employer relationship that involve distinct contractual obligations. Cases of visitors from industry and/or foreign institutions may require separate treatment.

1. Clauses regarding partial/full assignment/protection of IP should not be considered a default standard in visitor agreements unless (i) the visitor will perform experiments and/or directly utilize laboratory equipment, reagents and materials; (ii) the visitor will discuss, view, or in any way interact with proprietary information/materials owned by the university. Identifying such cases could be addressed via simple online questionnaires when establishing the agreement.

2. In the event that language regarding assignment/protection of IP is necessary, such clauses in visitor agreements should not attempt to take primacy over the IP assignment that the visitor almost certainly has signed with their employee. Rather, they should specify their rights to ownership over their employee’s portion of IP generated as a result of the collaborative visit. At the end of the day, if IP is generated, then the scientists and engineers involved at both institutions will either come to a satisfactory division of percentage stakes or not. Technology transfer offices at major research universities routinely interact with each other and, I imagine, would be receptive to such a collaborative approach that they routinely practice, irrespective of the rules on the books that may have emerged from other administrative offices.
Thankfully, I believe that an excellent model for the default IP clause in visitor agreement already exists! It is part of the Visitor Agreement to the Kavli Institute of Theoretical Physics at UC Santa Barbara. I have highlighted the clauses that I think should become the new default:

“INTELLECTUAL PROPERTY – To the extent legally permissible and subject to any overriding UC obligations to third parties, your home institution may retain ownership of any patentable inventions or copyrightable software you may develop during your work at KITP while participating in a KITP program as long as: (1) you will be visiting KITP for less than one year; (2) you do not need to be entered into UCSB’s official payroll system; (3) any financial support provided to you by KITP is for travel, living expenses and other similar costs and not to support direct research activities/projects (participation in the KITP program is not considered a direct research activity); and, (4) your activities occur in KITP facilities, only. Please note that if you engage in any research-related activities outside of KITP facilities and programs, the UC’s standard intellectual property policies, which require the UC to own intellectual property developed by visitors using UC facilities, will apply”

Figure 2 – KITP, indeed why would you go elsewhere on campus if you were based here!

The benefits of such a clause is that it assumes the default mode of a visit is for scientific collaboration. That is absolutely correct! Further, it does not try and replace the established agreement of visitors with their home institutions. Indeed, other institutions have tried (to some extent) to address this point, e.g., that of Stanford University which explicitly creates an alternative agreement if a prior employer agreement is already in place: SU 18A. A simple questionnaire could be used to help identify cases that require further discussion. The key here is simple, since creating yet another bureaucratic layer of complexity to visits is not what the scientific community wants nor needs. Indeed, a final recommendation should be:

3. IP agreements should not become an element of short-term visits where laboratory access is not needed (i.e., as part of seminars, symposia, mini-conferences, etc.).

Of course, some might argue: all of this is moot since universities should not be in the business of generating, retaining and fighting for IP that is created with tax dollars, but should give away access to all published inventions. That perspective is important and discussed extensively elsewhere. However, the issue at stake is that for some institutions, IP related revenue is a significant portion of income and this is a likely driver of the increasingly aggressive visitor IP policies that are unlikely to disappear.

In closing, I hope that this entry stimulates some discussion and perhaps even productive conversations to minimize the extent to which IP clauses act at cross-purposes with the visits of Principle Investigators, postdocs and students between peer institutions.


All parties, both at UA and GT, have been incredibly helpful and highly sympathetic as I explained my rationale for taking issue with the IP clause of my visitor agreement. Perhaps they were also surprised that I read the agreement. In the interest of expediting the process, I handed over my case to the appropriate representatives at GT and at UA. After some discussion, they found a way forward. Recall that the original IP clause in my visitor agreement includes a reference to claims of ownership under the “ABOR IP Policy”. The relevant ABOR IP policy applies to two types of individuals:

  1. Any intellectual property created by a university or Board employee in the course and scope of employment, and
  2. Any intellectual property created with the significant use of Board or university resources, unless otherwise provided in an authorized agreement for the use of those resources.”

The terms of my visitor agreement make it clear that I am not an employee. Hence case 1 does not apply Second, because of my situation as a theoretician, I do not plan on using any UA laboratories, equipment, materials, etc.. Moreover, all of my computational-based research will continue to be conducted on GT-owned or personally owned computers. Hence, the only “resources” I plan to utilize are: (i) an office; (ii) the internet. Both UA and GT agree that such activities do not cross the threshold for “significant use” of resources. Hence, case 2 does not apply. So long as my use does not change, then UA should not have standing to claim any IP and my signing of this agreement would not conflict with my employee contract at GT. This understanding now has a paper trail.

As of mid-September 2013 (one month or so after my initial refusal to sign the DCC), I have now signed the agreement and am now officially a Visiting Professor in the Department of Ecology and Evolutionary Biology at the University of Arizona.

Guest post from Joshua Weitz: Talking about the PI Sabbatical Beforehand: A Brief Guide for Faculty, Postdocs, and PhD Students in the Sciences

Guest Post by Joshua Weitz, Associate Professor, School of Biology and School of Physics, Georgia Institute of Technology


I direct a theoretical ecology and quantitative biology group based in the School of Biology at Georgia Tech.  I am going on a 9 month “leave” (Georgia Tech does not call them sabbaticals) to the Department of Ecology and Evolutionary Biology at the U of Arizona in Tucson, AZ from August 2013-May 2014 where I will be based in Matthew Sullivan’s Tucson Marine Phage Laboratory.  In preparation for this leave, our group held an interactive discussion on challenges and opportunities arising from the PI sabbatical for faculty, postdocs and PhD students in the sciences.  The discussion took place in four parts in a one-hour period.  Below I describe the setup of the discussion followed by specific recommendations for faculty, postdocs and PhD students prior to the PI sabbatical. 
How to Talk about the PI Sabbatical
Part 1 – the setup: I asked for a show of hands of group members who had thought about how my sabbatical would change the group and its dynamics?  Nearly all members raised their hands.  When asked, the group members also noted that they were most concerned about how the sabbatical would affect them.  Hence, in an effort to try and understand the effect of the sabbatical on all members, we split into three small discussion groups which were asked to identify challenges and opportunities for (i) the PI; (ii) postdocs; (iii) PhD students. 
Part 2 – small group discussion: The individual groups talked about how the sabbatical would affect different group members.  There are currently 9 members in the group (not including the PI), so we divided into three groups of three (I did not actively participate in the small group discussions, but did check in on all three groups).  The PI group was comprised of one postdoc, 1 graduate and 1 undergraduate.  The postdoc group had 1 postdoc and 2 graduate students.  The grad student group had 3 graduate students.  Hence, the first item of discussion was an effort to identify issues at stake at each career stage.  Then, the groups began to discuss how the sabbatical might change business as usual.  The groups spoke for ~15 minutes.
Part 3 – reporting: Challenges and opportunities were identified for each of the three categories.  A number of salient themes emerged that serve as general recommendations.   The consensus was that a number of common themes would emerge prior to a sabbatical although each science research group may differ in its own interactions.  Our presumption is that the PI was going alone and this shaped the nature of our recommendations.  First, as suggested by one of the students, there was a sense that the PI sabbatical would lead the students into a “Spiderman situation” in the sense that “with great power comes great responsibility”.  The PI sabbatical would lead to greater independence for group members and that this independence involves greater need for self-motivation, taking a holistic (long-term) view of one’s research, and increased pre-planning given the changes to the PI’s availability.  Second, clear communication is essential. For example, if a PI plans to be incommunicado for long stretches of time, this may be manageable (even if non-ideal from a student perspective), so long as provisions have been made to handle both the administrative and research duties that the PI normally would handle.  As a rule of thumb, the greater the change in PI availability, the greater the need for pre-planning to ensure that students and postdocs remain on track for research, career and personal development goals.
Part 4 – the view of the PI:  I provided additional feedback, tailored to the group and specifically addressed an issue that could create the most anxiety: my availability for one-on-one interactions.  I also distributed an initial recommendation list, modified here in light of group discussion. 

Five Specific pre-Sabbatical Recommendations for Faculty, Postdocs and PhD Students

1.     Develop a plan for your year ahead: what are the key goals for the sabbatical?
2.     Identify what is going to be different and what is going to remain the same: e.g., a new project(s), less/no teaching; less/no administrative duties, a new interaction schedule with the group, etc.
3.     Communicate your plans for the sabbatical and your expectations of group members to the group (ideally, after a group discussion of the kind outlined here).
4.     Talk to your Chair about expectations for your year and new expectations (if any) upon your return (and talk to your departmental admin team to make sure they are aware of your plans).
5.     Establish new interactions with your local host and host community.
1.     Establish a regular schedule of interactions with your adviser.
2.     Keep focused on your research & career goals (i.e., do not become a proxy adviser in the absence of the PI, i.e., see 3 & 4 below)
3.     Determine your supervisory responsibilities – what is your (limited) role to advise the students, technicians in the group?
4.     Determine your lab management responsibilities – what is your (limited) role in ordering and other admin duties?
5.     Travel to collaborators and mentors – do not just stay put while your adviser is away.
PhD students
1.     Identify the major research and career development expectations during your adviser’s time away – how will the adviser’s absence affect your thesis (if at all)?
2.     Establish a regular schedule of interactions with your adviser and senior members of the group.
3.     Contact your adviser, even off-schedule if you really need advice.
4.     Remember: your PI’s sabbatical is an opportunity for independence, increased self-motivated work and development as a scholar, not a “holiday”.
5.     Identify a local faculty member who can serve as an occasional resource to provide input and thoughts on your thesis work (this should be coordinated in advance, with your PI).
Final thoughts:
A PI sabbatical can be a very positive opportunity for all group members to become more independent, to set off on new directions, and to bring greater creativity and productivity to a group.  However, two notes of caution.  First, if you are not yet in a group, think carefully before joining with an absent PI, as the initial period in a group (regardless of your status) often sets the frame for the long-term interaction.  Second, the PI remains the PI, so be wary of a sabbatical plan that involves anyone other than the PI becoming the acting group leader.   Although certain senior members may take over duties, the sabbatical plan should (ideally) involve availability of the PI to make key decisions critical to the group, including thesis advancement, hiring/firing and mediation of major conflicts.

And, I suppose I’ll have to revisit this guide next year to report back on what worked and what we should have thought of in advance!


  •  Dr. Joshua Weitz
  •  Dr. Alexander Bucksch
  •  Dr. Michael Cortez
  •  Abhiram Das, PhD candidate
  •  Cesar Flores, PhD candidate
  •  Luis Jover, PhD candidate
  •  Gabriel Mitchell, PhD candidate
  •  Bradford Taylor, PhD candidate
  •  Charles Wigington, PhD candidate
  •  Victoria Chou, NSF REU student
Further reading
Many blogs are available detailing sabbatical “adventures” and “diaries”.

Guest post from Kimmen Sjölander about FAT-CAT phylogenomics pipeline

Below is a guest post from my friend and colleague Kimmen Sjölander, Prof. at UC Berkeley and phylogenomics guru. 

Announcing the FAT-CAT phylogenomic annotation webserver.

FAT-CAT is a new web server for phylogenomic prediction of function and ortholog identification and for taxonomic origin prediction of metagenome sequences based on HMM-based classification of protein sequences to >93K pre-calculated phylogenetic trees in the PhyloFacts database. PhyloFacts is unique among phylogenomic databases in having both broad taxonomic coverage – more than 7.3M proteins from >99K unique taxa across the Tree of Life, including targeted coverage of genomes from Eukaryotes, Bacteria and Archaea — and integrating functional data on trees for Pfam domains and multi-domain architectures. PhyloFacts trees include functional and annotation data from UniProt (SwissProt and TrEMBL), GO, BioCyc, Pfam, Enzyme Commission and other sources. The FAT-CAT pipeline uses HMMs at all nodes in PhyloFacts trees to classify user sequences to different levels of functional hierarchies, based on the subtree HMM giving the sequence the strongest score. Phylogenetic placements within orthology groups defined on PhyloFacts trees are used to to predict function and to predict orthologs. Sequences from metagenome projects can be classified taxonomically based on the MRCA of the sequences descending from the top-scoring subtree node. Because of the broad taxonomic and functional coverage, FAT-CAT can identify orthologs and predict function for most sequence inputs. We’re working to make FAT-CAT less computationally intensive so that users will be able to upload entire genomes for analysis; in the interim, we limit users to 20 sequence inputs per day. Registered users are given a higher quota (see details online). We’d love to hear from you if you have feature requests or bug reports; please send any to Kimmen Sjölander – kimmen at berkeley dot edu (parse appropriately). 

Guest Post on Viruses from Claudiu Bandea

From here.

Guest Post Today from Claudiu Bandea .

Claudiu wrote to me after my paper on “Stalking the Fourth Domain” came out.

He wrote


I posted a comment on your ‘PLoSOne paper’ blog, but I thought of sending you this mail. 

You might be interested in taking a look at the attached paper presenting a fusion model for the origin of ‘ancestral viruses’ from parasitic or symbiotic cellular species, and its implication for the evolution of viruses and cellular domains, which I’m attaching here (you can see the entire series, including comments, at: Possibly, the novel sequences you discovered belong to such ‘transitional forms’ between the cellular domains and the viral domains.

I know it’s a lot of material, but you might want to focus on Fig. 4 and the related discussion about TOLs from the perspective of the current hypotheses on origin and evolution of viruses. Because of your interest in TOL, I want to ask your thoughts on the difference between the concept of TOL based on the line-of-descent, the ways it was historically intended, and the current approaches of using (mostly) sequences which, as you know, due to LGT might not necessarily reflect the line-of-descent relationships.



After a bit of a back and forth I offered to let him write a guest post on my blog about this. He accepted my offer.  I note – I am not endorsing any of his ideas here and to be honest I have not read his papers he refers to – I have skimmed them and the seem interesting but have not had a chance to read them.  I also note – I am a bit uncomfortable with the fact that I cannot seem to find any Web Profile / Web Site / Blog / etc. with more detail about him and his work.  On one hand – ideas are ideas and they can and should stand on their own.  On the other hand context is useful in many cases and I feel like I am missing some context here.  He works at the CDC but I am not sure what he actually does there.  But in the interest of open discussion of ideas and since, well, not having a web site is certainly not a crime, his post is below.

The most efficient way of silencing ideas is not by criticizing them but by pretending they don’t exist. The antidote might be the blogging world.

A couple of decades ago, I published a novel model on the evolutionary origin of ancestral viral lineages. Recently, I updated this model and integrated it into an ambitious unifying scenario on the origin and evolution of cellular and viral domains, including the origin of life; well, that might have just buried it so deep that it’s gone for good even for those with an open mind and noble intentions.

So, I would like to ask you the favor of reviewing and criticizing this model. As a primer, you might want to read a comment I posted last summer on a book review by Robin Weiss. The book was Carl Zimmer’s A Planet of Viruses and the review by Dr. Weiss, one of the most distinguished contemporary virologists, was entitled Potent Tiny Packages, which symbolizes our century-long perspective on the nature of viruses as virus particles. If we have reasons to call Earth a planet of viruses, as I think Carl successfully made the point, then viruses require our full attention, including the right to be correctly identified and to be included in the Tree of Life.

I know, this is a lot of material, but I hope you’ll find it interesting, and I would be thrilled to address your questions and listen to your ideas.

Guest post from Russell Neches @ryneches, PhD student in my lab "Blogging his qualifying exam"

Below is a guest post from Russell Neches a PhD student in my lab.

Blogging my Qualifying Exam

Because this seems to be my default mode of organizing my thoughts when it comes to research, I’ve decided to write my dissertation proposal as a blog post. This way, when I’m standing in front of my committee on Thursday, I can simply fall back on one my more more annoying habits; talking at length about something I wrote on my blog. Or, since he has graciously lent me his megaphone for the occasion, I can talk at length about something I wrote on Jonathan’s blog.

Introduction : Seeking a microbial travelogue
Last summer, I had a lucky chance to travel to Kamchatka with Frank Robb and Albert Colman. It was a learning experience of epic proportions. Nevertheless, I came home with a puzzling question. As I continued to ponder it, the question went from puzzling to vexing to maddening, and eventually became an unhealthy obsession. In other words, a dissertation project. In the following paragraphs, I’m going to try to explain why this question is so interesting, and what I’m going to do to try answer it.
About a million years ago (the mid-Pleistocene), one of Kamchatka’s many volcanoes erupted and collapsed into its magma chamber to form Uzon Caldera. The caldera floor is now a spectacular thermal field, and one of the most beautiful spots on the planet. I regularly read through Igor Shpilenok’s Livejournal, where he posts incredible photographs of Uzon and the nature reserve that encompasses it. It’s well worth bookmarking, even if you can’t read Russian.

The thermal fields are covered in hot springs of many different sizes. Here’s one of my favorites :

Each one of these is about the size of a bowl of soup. In some places the springs are so numerous that it is difficult to avoid stepping in them. You can tell just by looking at these three springs that the chemistry varies considerably; I’m given to understand that the different colors are due to the dominant oxidation species of sulfur, and the one on the far left was about thirty degrees hotter than the other two. All three of them are almost certainly colonized by fascinating microbes.
The experienced microbiologists on the expedition set about the business of pursuing questions like Who is there? and What are they doing? I was there to collect a few samples for metagenomic sequencing, and so my own work was completed on the first day. I spent the rest of my time there thinking about the microbes that live in these beautiful hotsprings, and wondering How did they get there?

Extremophiles are practically made-to-order for this question. The study of extremophile biology has been a bonanza for both applied and basic science. Extremophiles live differently, and their adaptations have taught us a lot about how evolution works, about the history of life on earth, about biochemistry, and all sorts of interesting things. However, their very peculiarity poses an interesting problem. Imagine you would freeze to death at 80° Celsius. How does the world look to you? Pretty inhospitable; a few little ponds of warmth dotted across vast deserts of freezing death.
Clearly, dispersal plays an essential role for the survival and evolution of these organisms, yet we know almost nothing about how they do it. The model of microbial dispersal that has reigned supreme in microbiology since it was first proposed in 1934 is Lourens Baas Becking’s, “alles is overal: maar het milieu selecteert” (everything is everywhere, but the environment selects). This is a profound idea; it asserts that microbial dispersal is effectively infinite, and that differences in the composition of microbial communities is due to selection alone. The phenomenon of sites that seem identical but have different communities is explained as a failure to understand and measure their selective properties well enough.

This model has been a powerful tool for microbiology, and much of what we know about cellular metabolism has been learned by the careful tinkering with selective growth media it exhorts one to conduct. Nevertheless, the Baas Becking model just doesn’t seem reasonable. Microbes do not disperse among the continents by quantum teleportation; they must face barriers and obstacles, some perhaps insurmountable, as well as conduits and highways. Even with their rapid growth and vast numbers, this landscape of barriers and conduits must influence their spread around the world.

Ecologists have known for a very long time that these barriers and conduits are crucial evolutionary mechanisms. Evolution can be seen as an interaction of two processes; mutation and selection. The nature of the interaction is determined by the structure of the population in which they occur. This structure is determined by biological processes such as sexual mechanisms and recombination, which are in turn is determined chiefly by the population’s distribution in space and its migration in that space.

As any sports fan knows, the structure of a tournament can be more important than the outcome of any particular game, or even the rules of the game. This is true for life, too. From one generation to the next, genes are shuffled and reshuffled through the population, and the way the population is compartmentalized sets the broad outlines of this process.

A monolithic population — one in which all players are in the same compartment — evolves differently than a fragmented population, even if mutation, recombination and selection pressures are identical. And so, if we want to understand the evolution of microbes, we need to know something about this structure. Bass Becking’s hypothesis is a statement about the nature of this structure, specifically, that the structure is monolithic. If true, it means that the only difference between an Erlenmeyer flask and the entire planet is the number of unique niches. The difference in size would be irrelevant.

This is a pretty strange thing to claim. And yet, the Baas Becking model has proved surprisingly difficult to knock down. For as long as microbiologists have been systematically classifying microbes, whenever they’ve found similar environments, they’ve found basically the same microbes. Baas Becking proposed his hypothesis in an environment of overwhelming evidence.

However, as molecular techniques have allowed researchers to probe deeper into the life and times of microbes (and every other living thing), some cracks have started to show. Rachel Whitaker and Thane Papke have challenged the Bass Becking model by looking at the biogeography of thermophilic microbes (such as Sulfolobus islandicus anOscillatoria amphigranulata), first by 16S rRNA phylogenetics and later using high resolution, multi-locus methods. Both Rachel’s work and Papke’s work, as well as many studies of disease evolution, very clearly show that when you look within a microbial species, the populations do not appear quite so cosmopolitan. While Sulfolobus islandicus is found in hot springs all over the world, the evolutionary distance between each pair of its isolates is strongly correlated with the geographic distance between their sources. So, these microbes are indeed getting around the planet, but if we look at their DNA, we see that they are not getting around so quickly. 

However, Baas Becking has an answer for this; “…but the environment selects.” What if the variation is due to selection acting at a finer scale?  It’s well established that species sorting effects play a major role in determining the composition of microbial communities at the species level. There is no particular reason to believe that this effect does not apply at smaller phylogenetic scales. The work with Sulfolobus islandicus attempts to control for this by choosing isolates from hot springs with similar physical and chemical properties, but unfortunately there is no such thing as a pair of identical hot springs. Just walk the boardwalks in Yellowstone, and you’ll see what I mean. The differences among the sites from which these microbes were isolated can always be offered as an alternative explanation to dispersal. Even if you crank those differences down to nearly zero, one can always suggest that perhaps there is a difference that we don’t know about that happened to be important.

This is why the Baas Becking hypothesis is so hard to refute: One must simultaneously establish that there is a non-uniform phylogeographic distribution, and that this non-uniformity is not due to selection-driven effects such as species sorting or local adaptive selection. To do this, we need a methodology that allows us to simultaneously measure phylogeography and selection.

There are a variety of ways of measuring selection. Jonathan’s Evolution textbook has a whole chapter about it. I’ll go into a bit more detail in Aim 3, but for now, I’d just like to draw attention to the fact that the effect of selection does not typically fall uniformly across a genome. This non-uniformity tends to leave a characteristic signature in the nucleotide composition of a population. Selective sweeps and bottlenecks, for example, are usually identified by examining how a population’s nucleotide diversity varies over its genome.

For certain measures of selection (e.g., linkage disequilibrium) one can design a set of marker genes that could be used to assay the relative effect of selection among populations. This could then extend the single species, multi-locus phylogenetic methods that have already been used to measure the biogeography of microbes to include information about selection. This could, in principle, allow one to simultaneously refute “everything is everywhere…” and “…but the environment selects.” However, designing and testing all those markers, ordering all those primers and doing all those PCR reactions would be a drag. If selection turned out to work a little differently than initially imagined, the data would be useless.

But, these are microbes, after all. If I’ve learned anything from Jonathan, it’s that there is very little to be gained by avoiding sequencing.

We’re getting better and better at sequencing new genomes, but it is not a trivial undertaking. However, re-sequencing genomes is becoming routine enough it’s replacing microarray analysis for many applications. The most difficult part of re-sequencing an isolate is growing the isolate. Fortunately, re-sequencing is particularly well suited for culture-independent approaches. As long as we have complete genomes for the organisms we’re interested in, we can build metagenomes from environmental samples using our favorite second-generation sequencing platform. Then we simply map the reads to the reference genomes. The workflow is a bit like ChIP-seq, except without culturing anything and without the ChIP. We go directly from the environmental sample to sequencing to read-mapping. Maybe we can call it Eco-seq? That sounds catchy.

Not only is the whole-genome approach better, but with the right tools, it is easier and cheaper that multi-locus methods, and allows one to include many species simultaneously. The data will do beautifully for phylogeography, and have the added benefit that we can recapitulate the multi-locus methodology by throwing away data, rather collecting more.

To implement this, I have divided my project into three main steps :

  • Aim 1 : Develop a biogeographical sampling strategy to optimize representation of a natural microbial community
  • Aim 2 : Develop an apply techniques for broad matagenomic sampling, metadata collection and data processing
  • Aim 3 : Test the dispersal hypothesis using a phylogeographic model with controls for local selection
But, before I get into the implementation, I should pause for a moment and make sure I’ve stated my hypothesis perfectly clearly : I think that dispersal plays a major role in the composition of microbial communities. The Baas Becking hypothesis doesn’t deny that dispersal happens, in fact, it asserts that dispersal is infinite, but that it is selection, not dispersal, that ultimately determines which microbes are found in any particular place. If I find instead that dispersal itself plays a major role in determining community composition, then the world is a very different place to be a microbe.
Aim 1 : Develop a biogeographical sampling strategy to optimize the representation of a complete natural community

While I would love to keep visiting places like Kamchatka and Yellowstone, I’ve decided to study the biogeography of halophiles, specifically in California and neighboring states. Firstly, because I can drive and hike to most of the places were they grow. Secondly, because the places where halophiles like to grow tend to be much easier to get permission to sample from. Some of them are industrial waste sites; no worry about disturbing fragile habitats. Thirdly, because our lab has been heavily involved in sequencing halophile genomes, which are necessary component of my approach. There is also a fourth reason, but I’m saving it for the Epilogue.

As I have written about before, the US Geological Survey has built a massive catalog of hydrological features across the Western United States. It’s as complete a list of the substantial, persistent halophile habitats one could possibly wish for. It has almost two thousand possible sites in California, Nevada and Oregon alone :
USGS survey sites. UC Davis is marked with a red star.

The database is complete enough that we can get a pretty good sense of what the distribution of sites looks like within this region just by looking at the map. The sites are basically coincident with mountain ranges. Even though they aren’t depicted, the Coastal Range, the Sierras, the Cascades and the Rockies all stand out. This isn’t surprising; salt lakes require some sort of constraining geographic topology, or the natural drainage would simply carry the salt into the ocean. Interestingly, hot springs are also usually found in mountains (some of these sites are indeed hot springs), but that has less to do with the mountains themselves as it does with the processes that built mountains. To put it more pithily, you find salt lakes where there are mountains, but you find mountains where there are hot springs. 
This database obviously contains too many sites to visit. It took Dr. Mariner’s team forty years to gather all of this information. I need to choose from among these sites. But which ones? Is there a way to know if I’m making good selections? Does it even matter? 
As it turns out, it does matter. When we talk about dispersal in the context of biogeography, we are making a statement about the way organisms get from place to place. Usually, we expect to see a distance decay relationship, because we expect that more distant places are harder to get to, and thus the rates of dispersal across longer distances should be lower. I need to be reasonably confident that I will see the same distance-decay relationship within the sub-sample that I would have seen for every site in the database. This doesn’t necessarily mean that the microbes will obey this relationship, but if they do, I need data that would support the measurement.
There is a pretty straightforward way of doing this. If we take every pair of sites in the database, calculate the Great Circle distance between them, and then sort these distances, we can get spectrum of pairwise distances. Here’s what that looks like for the sites in my chunk of the USGS database  :

The spectrum of pairwise distances among all sites in the USGS databse (solid black), among randomly placed sites over the same geographic area (dashed black), and among random sub-sample of 360 sites from the database (solid red).
I’ve plotted three spectra here. The dashed black line is what you’d get if the sites had been randomly distributed over the same geographic area, and the solid black line is the spectra of the actual pairwise distances. As you can see, the distribution is highly non-random, but we already knew this just by glancing at the map. The red line is the spectrum of a random sub-sample of 360 sites from the database (I chose 360 because that is about how many samples I could collect in five one-week road trips). 
This sub-sample matches the spectrum of the database pretty well, but not perfectly. It’s easy to generate candidate sub-samples, and they can be scored by how closely their spectra match the database. I’d like to minimize the amount of time it takes me to finish my dissertation, which I expect will be somewhat related to the number of samples I collect. There is a cute little optimization problem there.
Although I’ve outlined the field work, laboratory work and analysis as separate steps, these things will actually take place simultaneously. After I return from the field with the first batch of samples, I will process and submit them for sequencing before going on the next collection trip. I can dispatch the analysis pipeline from pretty much anywhere (even with my mobile phone). That’s why I’ve set aside sample selection and collection as a separate aim. The sample selection process determines where to start, how to proceed, and when I’m done.

Aim 2 : Develop an apply techniques for broad matagenomic sampling, metadata collection and data processing

In order to build all these genomes, I need to solve some technical problems. Building this many metagenomes is a pretty new thing, and so some of the tools I need did not exist in a form (or at a cost) that is useful to me. So, I’ve developed or adapted some new tools to bring the effort, cost and time for large-scale comparative metagenomics into the realm of a dissertation project.

There are four technical challenges :

  • Quickly collect a large number of samples and transport them to the laboratory without degradation.
  • Build several hundred sequencing libraries.
  • Collect high-quality metadata describing the sites.
  • Assemble thousands of re-sequenced genomes.
To solve each of these problems, I’ve applied exactly the same principle : Simplify and parallelize. I can’t claim credit for the idea here, because I was raised on it. Literally.
Sample collection protocol
When I first joined Jonathan’s lab, Jenna Morgan (if you’re looking for her newer papers, make sure to add “Lang,” as she’s since gotten married) was testing how well metagenomic sequencing actually represents the target environment. In her paper, now out in PLoS ONE, one of the key findings is that mechanical disruption is essential. 
I learned during my trip to Kamchatka that getting samples back to the lab without degradation is very hard, and it really would be best to do the DNA extraction immediately. Unfortunately, another lesson I learned in Kamchatka is that it is surprisingly difficult to do molecular biology in the woods. One of the ways I helped out while I was there was to kill mosquitoes trying to bite our lab technician so she wouldn’t have to swat them with her gloved hands. It’s not easy to do this without making an aerosol of bug guts and blood over the open spin columns. 
So, I was very excited when I went to ASM last year, and encountered a cool idea from Zymo Research. Basically, it’s a battery-operated bead mill, and a combined stabilization and cell lysis buffer. This solves the transportation problem and the bead-beating problem, without the need to do any fiddly pipetting and centrifuging in the field. Also, it looks cool.

Unfortunately, the nylon screw threads on the sample processor tend to get gummed up with dirt, so I’ve designed my own attachment that uses a quick-release style fitting instead of a screw top.

It’s called the Smash-o-Tron 3000, and you can download it on Thingiverse.

Sequencing library construction

The next technical problem is actually building the sequencing libraries. Potentially, there could be a lot of them, especially if I do replicates. If I were to collect three biological replicates from every site on the map, I would have to create about six thousand metagenomes. I will not be collecting anywhere close to six thousand samples, but I thought it was an interesting technical problem. So I solved it.

Well, actually I added some mechanization to a solution Epicentre (now part of Illumina) marketed, and my lab-mates Aaron Darling and Qingyi Zhang have refined into a dirt-cheap multiplexed sequencing solution. The standard technique for building Illumina sequencing libraries involves mechanically shearing the source DNA, ligating barcode sequences and sequencing adapters to the fragments, mixing them all together, and then doing size selection and cleanup. The first two steps of this process are fairly tedious and expensive. As it turns out, Tn5 transposase can be used to fragment the DNA and ligate the barcodes and adapters in one easy digest. Qingyi is now growing huge quantities of the stuff.
The trouble is that DNA extraction yields an unpredictable amount of DNA, and the activity of Tn5 is sensitive to the concentration of target DNA. So, before you can start the Tn5 digest, you have to dilute the raw DNA to the right concentration and aliquat the correct amount for the reaction. This isn’t a big deal if you have a dozen samples. If you have thousands, the dilutions become the rate limiting step. If I’m the one doing the dilutions, it becomes a show-stopper at around a hundred samples. I’m just not that good at pipetting. (Seriously.)
The usual way of dealing with this problem is to use a liquid handling robot. Unfortunately, liquid handling robots are stupendously expensive. Even at their considerable expense, many of them are shockingly slow.
To efficiently process a large number of samples, we need to be able to treat every sample exactly the same. This way, can bang through the whole protocol with a multichannel pipetter. It occurred to me that many companies sell DNA extraction kits that use spin columns embedded in 96-well plates, and we have a swinging bucket centrifuge with a rotor that accommodates four plates at a time. So, the DNA extraction step is easy to parallelize. The Tn5 digests work just fine in 96-well plates. 
We happen to have (well, actually Marc’s lab has) a fluorometer that handles 96-well plates. Once the DNA extraction is finished, I can use a multichannel pipetter to make aliquats from the raw DNA, and measure the DNA yield for each sample in parallel. So far, so good.
Now, to dilute the raw DNA to the right concentration for the Tn5 digest, I need to put an equal volume of raw DNA into differing amounts of water. This violates the principle of treating every sample the same, which means I can’t use a multichannel pipetter to get the job done. That is, unless I have a 96-well plate that looks like this :

Programmatically generated dilution plate CAD model
I wrote a piece of software that takes a table of concentration measurements from the fluorometer, and designs a 96-well plate with wells of the correct volume to dilute each sample to the right concentration for the Tn5 digest. If I make one of these plates for each batch of 96 samples, I can use a multichannel pipetter throughout.
Of course, unless you are Kevin Flynn, you can’t actually pipette liquids into a 3D computer model and achieve the desired effect. To convert the model from bits into atoms, I ordered a 3D printer kit from Ultimaker. (I love working in this lab!)

The Ultimaker kit
After three days of intense and highly entertaining fiddling around, I managed to get the kit assembled. A few more days of experimentation yielded my first successful prints (a couple of whistles). A few days after that, I was starting my first attempts to build my calibrated volume dilution plates.

Dawei Lin and his daughter waiting for their whistle (thing 1046) to finish printing.
Learning about 3D printing has been an adventure, but I’ve got the basics down and I’m now refining the process. I’m now printing plates with surprisingly good quality. I’ve had some help from the Ultimaker community on this, particularly from Florian Horsch.

Much to my embarrassment, the first (very lousy) prototype of my calibrated volume dilution plate ended up on AggieTV. Fortunately, the glare from the window made it look much more awesome than it actual was.
The upshot is that if I needed to make ten or twenty thousand metagenomes, I could do it. I can print twelve 96-well dilution plates overnight. Working at a leisurely pace, these would allow me to make 1152 metagenome libraries in about two afternoons’ worth of work.

I’m pretty excited about this idea, and there are a lot of different directions one could take it. The College of Engineering here at UC Davis is letting me teach a class this quarter that I’ve decided to call “Robotics for Laboratory Applications,” where we’ll be exploring ways to apply this technology to molecular biology, genomics and ecology. Eight really bright UC Davis undergraduates have signed up (along with the director of the Genome Center’s Bioinformatics Core), and I’m very excited to see what they’ll do!

Environmental metadata collection

To help me sanity check the selection measurement, I decided that I wanted to have detailed measurements of environmental differences among sample sites. Water chemistry, temperature, weather, and variability of these are known to select for or against various species of microbes. The USGS database has extremely detailed measurements of all of these things, all the way down to the isotopic level. However, I still need to take my own measurements to confirm that the site hasn’t changed since it was visited by the USGS team, and to get some idea of what the variability of these parameters might be. It would also be nice if I could retrieve the data remotely, and not have to make return trips to every site.

Unfortunately, these products are are extraordinarily expensive. The ones that can be left in the field for a few months to log data cost even more. The ones that can transmit the data wirelessly are so expensive that I’d only be able to afford a handful if I blew an entire R01 grant on them.

This bothers me on a moral level. The key components are a few probes, a little lithium polymer battery, a solar panel the size of your hand, and a cell phone. You can buy them separately for maybe fifty bucks, plus the probes. Buying them as an integrated environmental data monitoring solution costs tens of thousands of dollars per unit. A nice one, with weather monitoring, backup batteries and a good enclosure could cost a hundred thousand dollars. You can make whatever apology you like on behalf of the industry, but the fact is that massive overcharging for simple electronics is preventing science from getting done.

So, I ordered a couple of Arduino boards and made my own.

My prototype Arduino-based environmental data logger. This version has a pH probe, Flash storage, and a Bluetooth interface.

The idea is to walk into the field with a data logger and a stick. Then I will find a suitable rock. Then I will pound the stick into the mud with the rock. Then I will strap the data logger to the stick, and leave it there while I go about the business of collecting samples. To keep it safe from the elements, the electronics will be entombed in a protective wad of silicone elastomer with a little solar panel and a battery.
The bill of materials for one of these data loggers is about $200, and so I won’t feel too bad about simply leaving them there to collect data. If the site has cell phone service, I will add a GSM modem to the datalogger (I like the LinkSprite SM5100B with SparkFun’s GSM shield), and transmit the data to my server at UC Davis through an SMS gateway. Then I don’t have to go back to the site to collect the data. This could easily save $200 worth of gasoline. I’ll put a pre-paid return shipping labels on them so that they can find their way home someday. I’m eagerly looking forward to decades of calls from Jonathan complaining about my old grimy data loggers showing up in his mail.
From the water, the data logger can measure pH, dissolved oxygen, oxidation/reduction potential, conductivity (from which salinity can be calculated), and temperature. I may also add a small weather station to record air temperature, precipitation, wind speed and direction, and solar radiation. I doubt if all of these parameters will be useful, but the additional instrumentation is not very expensive.

Assembling the genomes

The final technical hurdle is assembling genomes from the metagenomic data. If I have 360 sites and 100 reference genomes, I’m going to have to assemble 36,000 genomes. Happily, I am really re-sequencing them, which is much, much easier than de novo sequencing. Nevertheless, 36,000 is still a lot of genomes.
For each metagenome, I must :
  • Remove adapter contamination with TagDust
  • Trim reads for quality, discard low quality reads
  • Remove PCR duplicates
  • Map reads to references with bwabowtieSHRiMP, or whatever
This yields a BAM file for each metagenome, each representing an alignment of reads to each scaffold of each reference genome. All of the reference genomes can be placed into a single FASTA file with a consistent naming scheme for distinguishing among scaffolds belonging to different organisms. A hundred-odd archaeal reference genomes is about 200-400 megabases, or an order of magnitude smaller than the human genome. Using the Burrows-Wheeler Aligner on a reasonably modern computer, this takes just a few minutes for each metagenome. 
I’m impatient, though, and so I applied for (and received) an AWS in Education grant. Then I wrote a script that parcels each metagenome off to a virtual machine image, and then unleashes all of them simultaneously on’s thundering heard of rental computers. Once they finish their alignment, each virtual machine stores the BAM file in my Dropbox account and shuts down. The going rate for an EC2 Extra Large instance is $0.68 per hour. 
This approach could be used for any re-sequencing project, including ChIP-seq, RNA-seq, SNP analysis, and many others.
Aim 3 : Test the dispersal hypothesis using a phylogeographic model with controls for local selection

In order to test my hypothesis, I need to model the dispersal of organisms among the sites. However, in order to do a proper job of this, I need to make sure I’m not conflating dispersal and selective effects in the data used to initialize the model. There are three steps :

  • Identify genomic regions that have recently been under selection
  • Build genome trees with those regions masked out
  • Model dispersal among the sites
In all three cases, there are a large number of methods to choose from. 

One way of detecting the effects of selection is Tajima’s D. This measures deviation from the neutral model by comparing two estimators of the neutral genetic variation, one based on the nucleotide diversity and one based on the number of polymorphic sites. Neutral theory predicts that the two estimators are equal, and so genomic regions in which these two estimators are not equal are evolving in a way that is not predicted by the neutral model (i.e., they are under some kind of selection). One can do this calculation on a sliding window to measure Tajima’s D for each coordinate of each the genome of each organism. As it turns out, this exact approach was used by David Begun’s lab to study the distribution of selection across the Drosophilia genome

I will delete the regions of the genomes that deviate significantly (say, by more than one standard deviation) from neutral. Then I’ll make whole genome alignments, and build a phylogenetic trees for each organism. This tree would contain only characters that (at least insofar as you believe Tajima’s D and Wu and Fey’s FST) are evolving neutrally, and are not under selection.

A phylogenetic tree represents evolutionary events that have taken place over time. In order to infer the dispersal of the represented organisms, would need model where those events took place. Again, there are a variety of methods for doing this, and but my personal favorite is probably the approach used by Isabel Sanmartín for modeling dispersal of invertebrates among the Canary Islands. I don’t know if this is necessarily the best method, but I like the idea that the DNA model and the dispersal model use the same mathematics, and are computed together. Basically, they allowed each taxa to evolve its own DNA model, but constrained by the requirement that they share a common dispersal model. Then they did Markov Chain Monte Carlo (MCMC) sampling of the posterior distributions of island model parameters (using MrBayes 4.0).
According to Wikipedia, the most respected and widely consulted authority on this and every topic, the General Time Reversible Model it is the most generalized model describing the rates at which one nucleotide replaces another. If we want to know the rate at which a thymine turns into a guanine, we look at elment (2,3) of this matrix :
πG is the stationary state frequency for guanine, and rTG is the exchangability rate between from T to G. However, if we think of this a little differently, as Sanmartín suggests in her paper, we can use the GTR model for the dispersal of species among sites (or islands). If we want to know the rate at which a species migrates from island B to island C, we look in cell (2,3) of a very similar matrix :
Here, πC is the relative carrying capacity of island C, and rBC is the relative dispersal rate from island B to island C. Thus, the total dispersal from island i to island j is
dij = Nπirijπjm

where N is the total number of species in the system, and m is the group-specific dispersal rate. This might look something like this :
One nifty thing I discovered about MrBayes is that it can link against the BEAGLE library, which can accelerate these calculations using GPU clusters. Suspiciously, Aaron Darling is one of the authors. If you were looking for evidence that the Eisen Lab is a den of Bayesians, this would be it.

This brings us, at last, back to the hypothesis and Baas Becking. Here we have a phylogeographic model of dispersal among sites within a metacommunity, with the effects of selection removed. If the model predicts well-supported finite rates of dispersal within the metacommunity, my hypothesis is sustained. If not, then Baas Becking’s 78 year reign continues.

Epilogue : Lourens Baas Becking, the man verses the strawman

Lourens Baas Becking

Microbiologists have been taking potshots at the Baas Becking hypothesis for a decade or two now, and I am no exception. I’m certainly hoping that the study I’ve outlined here will be the fatal blow.

However, it’s important to recognize that we’ve been a bit unfair to Baas Becking himself. The hypothesis that carries his name is a model, and Baas Becking himself fully understood that dispersal must play an important role in community formation. He understood perfectly well that “alles is overal: maar het milieu selecteert” was not literally true; it is only mostly true, and then only in the context of the observational methodology available at the time. In 1934, in the same book where he proposed his eponymous hypothesis, he observed that there are some habitats that were ideally suited for one microbe or another, and yet these microbes were not present. He offered the following explanation: “There thus are rare and less rare microbes. Perhaps there are very rare microbes, i.e., microbes whose possibility of dispersion is limited for whatever reason.”
Useful models are never “true” in the usual sense of the word. Models like the Baas Becking hypothesis divide the world into distinct intellectual habitats; one in which the model holds, and one in which it doesn’t. At the shore between the two habitats, there is an intellectual littoral zone; a place where the model gives way, and something else rises up. As any naturalist knows, most of the action happens at interfaces; land and sea, sea and air, sea and mud, forest and prairie. The principle applies just as well to the landscape of ideas. The limits of a model, especially one as sweeping as Baas Becking’s, provides a lot of cozy little tidal ponds for graduate students to scuttle around in. 
By the way, guess where Lourens Baas Becking first developed his hypothesis? He was here in California, studying the halopiles of the local salt lakes. In fact, the very ones I will be studying.