Jonathan Eisen

John Novembre seminar “Ancestry inference and population genomics” #UCDavis

Genetics Seminar

“Ancestry inference and population genomics: Insights to
recombination, migration, and rare variant diversity”

Speaker: John Novembre

University of California, Los Angeles | Department of Ecology and Evolutionary Biology

Monday, June 4, 2012

4:10 PM

1022 Life Sciences

Overselling genomics award #7: Ron Davis & Forbes for PR presented as "essay"

Wow. Just saw this tweet by Dan Vorhaus:

This “article” in Forbes on a new genome interpretation co (Genophen) is ludicrous: onforb.es/LbfYA8 Pure PR fluff
— Dan Vorhaus (@genomicslawyer) June 4, 2012

//platform.twitter.com/widgets.js
So I decided to check it out. The piece is titled It’s Time to Bet on Genomics and it is, well, just completely in appropriate. Sure – it does take on an article that itself was over the top in downplaying the power of genomics (see Erika Check Hayden’s article about that issue here). But then Davis goes on to write about a company founded by an ex post doc of his for which Davis is one of the advisors (he does kindly let us know this, but still …). And what he writes he is a big big pile of fluff with no evidence presented. Among the lines in the “essay” I find disturbing:

One of the most interesting of these is being developed by Genophen
Genophen’s application is rather breathtaking in its ambition.
Genophen’s “risk engine”—a simple term for some very complex data mining and computer modeling—will map your risk factors against the world’s vast library of medical research and then offer up a personalized set of behavior and treatment recommendations that can help you reduce those risks . . . and even prevent disease itself.
We are now at the point where genomics-enabled medical technology can run various what-if scenarios and show you whether diet, exercise, medication, or some other factor or combination of factors has the greatest statistical likelihood of reducing that risk. The information can then be visually displayed through charts and graphs and made available to patients and their doctors via secure web-based portals.
But instinctively I believe it to be true, and anecdotally Genophen’s first trial provided some confirmation.
“The trial changed my life,” one female participant who wishes to remain anonymous told me.

All of this without any link to a paper, without any data, without any real details. Shameful. Not saying genomic medicine does not have a lot of promise. But this “essay” is so excessively focused on PR for one company that there is no reason to have any faith in anything said in it. I am therefore giving Ron Davis and Forbes my coveted Overselling Genomics Award (#7). Plus I think Forbes deserves some sort of award for “Publishing PR” but I will have to think one of those up. This piece almost certainly never should have been published at Forbes.Com without many many more caveats. Yuck.

UPDATE – here is a screenshot from the Forbes Web site. It is marked as “Forbes Leadership Forum” … hard to tell whether it is meant as an essay, editorial, op-ed, or what.

Preparation Y: Michelle Ellsworth – Performance art mixing dance, genomics, evolution, humor, sex #brilliant

Just got back from a Sloan Foundation funded meeting in Boulder, Colorado that focused on microbiology of the built environment. More on that another time. What I want to tell you about – no – what I need to tell you about – is the entertainment that the meeting organizers arranged at dinner Thursday night.

We had dinner at Red Lion Restaurant – a phenomenally gorgeous spot in the canyon just West of Boulder. And while we were milling around before dinner I saw out of the corner of my eye a woman walking in to the tent where we were to have dinner. She was dressed in almost all white and was carrying a giant silver spoon. So I asked the meeting organizer – Mark Hernandez – if she was the entertainment. And surprisingly he said – yes – she was a dancer and Professor at Boulder and also did a kind of science performance art.

Her name was Michelle Ellsworth and he said she was amazing. So I was very intrigued as I am a big fan of mixing science and art. So I went over to where she had set up and asked her a few questions and took her picture …

I grabbed a seat near the front of the area they set up for her. And Mark Hernandez introduced her

And then I witnessed what I consider to be – seriously – the most entertaining presentation I have ever seen at a conference. She presented her “Preparation Y” project focused on what should be done to prepare for “the obsolescence of the Y chromosome.” She then proceeded to discuss some relatively recent work on Y chromosome evolution in humans as well as an article about this by Maureen Dowd. And she posed the questions (tongue planted firmly in cheek … though done in a style that was remarkably earnest …)

“What will be missed when men are gone? How can we replace them with choreography, apparati and web technology?”

And then she proceeded to use here web site to take us on a tour of some of the issues associated with the demise of the Y chromosome.

Among the concepts she showed us were devices that can do things that men do that one possibly might miss if men were gone including a greenhouse gas emitter, a toilet seat raiser, a smallerizer, a flinger, and more. To get an idea of what these were go here and then click on the items on the circle (see screen shot below of the seat raiser).

She also had some philosophical discussions of evolution and ecology including pondering the role of men in extinction crises and greenhouse gas emissions and whether men could be considered a “keystone species.” It was a combination of geeky, absurd, sarcastic, genius, and outrageous. There was even an extensive discussion of man dances and both recording them and reenacting them (the dances were amazing I note) The funniest part (from my point of view) was the preservation of male artifacts including smells (which she cans like jam), socks (which she said sometimes she adds sugar to to preserve better), saliva, and more. I note – she said she brought vials to collect some additional male artifacts from the meeting for her collection. I think I was the only one who donated …

Some pictures of the “show” are below.

Anyway – if you ever get a chance to see her show or anyone of her presentations, you must do it. Also consider browsing her website which has some brilliant videos and other materials.

From her website I discovered she has some videos on Vimeo relating to the preparation Y concept.

Now I have to go check out her “Burger Foundation“. Not sure what it is … but I am sure it will be good …

5/31#UCDavis “Data Acquisition and Laboratory Tools; Management, Sharing and Ownership”

Please join us for the next session of the 2011-2012 Responsible Conduct of Research Program:

Data Acquisition and Laboratory Tools; Management, Sharing and Ownership

Thursday, May 31st

*1065 Kemper Hall, 12:10-1:10 PM

*Please note the time and room number for the May 31 session, as it differs slightly from the previous.

For more information, please visit the UC Davis RCR website: http://www.research.ucdavis.edu/rcr

Illumina Webinar: Analyzing Microbes and Complex Microbial Populations with Next-Generation Sequencing

www.illumina.com

Analyzing Microbes and Complex Microbial Populations with Next-Generation Sequencing

This webinar will introduce the latest advances in next-generation sequencing for analyzing microbial genomes and transcriptomes, and will present key studies highlighting this technology. Register today.
Date:	Thursday, June 7, 2012
Time:	9:00 AM (PT)
Speaker:	Abizar Lakdawalla, Ph.D. Illumina, Inc

Abstract

Sequencing microbial genomes provides a comprehensive understanding that no other method can provide. For example, it is now possible to achieve single base resolution of the bacterial chromosome, and detailed sequence of all extrachromosomal elements, including plasmids and phages. Improvements in next-generation sequencing methods now enable routine whole bacterial genome sequencing in a single day. Assembling the genome from sequencing reads can be easily performed on a desktop computer, allowing high resolution classification of bacterial subtypes. Many important features, including resistance, virulence, and pathogenicity, can be examined simultaneously with high accuracy. In addition to genomic experiments, next-generation sequencing can be used to analyze the complete transcriptome of microbes for interpretation of gene structure and regulation. Sequencing complex microbial populations or metagenomes provides a comprehensive census of species within samples, including those that cannot be cultured or phenotyped. Subtle changes in microbial populations resulting from, or predictive of, changes in the health status of a patient can be assessed easily and accurately with next-generation sequencing.

*As a registrant, you will receive an email with a recording of the presentation after the event should you be unable to attend the live presentation.

www.illumina.com

Diversity (of speakers, participants) at meetings: do something about it

Some unformed thoughts here but here goes.

Every so often I see a conference announcement and am annoyed by the XY/XX excess for the speakers. Some recent examples

Gairdner Foundation – Genome Canada “Genomics: the power and the promise“
IOM “Improving Safety Through One Health” (agenda here)
BGI “International Conference on Genomics in Europe”
BGI Genomics Meeting in US

And more.

Now – I complain about this here and there on Twitter and the like

40 Speakers at the BGI Genomics Meeting in US – 5 of them women – not good –shar.es/rEVh9
— Jonathan Eisen (@phylogenomics) April 20, 2012

//platform.twitter.com/widgets.js

Not impressed w/ the lack of diversity in the Genome Canada speakers featured here shar.es/qCvg1
— Jonathan Eisen (@phylogenomics) May 29, 2012

//platform.twitter.com/widgets.js
But I felt that this needed a blog post to not get lost in the Twitter stream. So here it is.

I note – I have posted about this issue previously: A conference where the speakers are all women? | The Tree of Life and for conference for which I am involved I have been trying very hard to work on the speaker diversity (not just XX vs XY, but age, career status, ethnicity, etc). And it certainly can be difficult to make sure that diversity is there. But the meetings I list above are pretty egregious. The Genome Canada one features seven major speakers – all white males. Yes, they are all big names. But in biology, where women are reasonably well represented, it suggests a bias to me if a meeting can somehow only manage to invite and/or attract all senior, white, XYs to be major speakers. Not sure what that bias is and it could be different in each case – could be who is invited – could be the field itself – could be timing/nature of the meeting – could be something to do with families (e.g., perhaps women are invited but are more likely to feel like limiting travel due to roles in child care).

Also I note – biases are not necessarily affecting any one gender or ethnic group. For example, I have generally stopped going to meetings/conferences that are on weekends and I have also stopped going to meetings/dinners after 6 PM because I do not want to skip out on time with my family.

So here is a plea. Next time you are involved in organizing a meeting – make some effort to have a strong representation of diversity of speakers and participants. For example, if you invite lots of women for example and all say no – try to figure out why and see if you can fix the issue. Offer travel fellowships for students. Offer child care or child activity options (even if you cannot pay for it – at least make it easy for people). Make sure to advertise/promote the meeting to groups/institutions with a high representation of underrepresented groups. Don’t give up if your first efforts don’t work. Sometimes it can be difficult to make sure diversity levels are high. But keep trying … it will help make the conference better and also will help the field in general …

For other posts on this topic see

Dubious Press Release from Cedars-Sinai linking Irritable Bowel Syndrome (IBS) and Bacteria in Gut

Quick one here.

Not impressed with this press release from Cedar-Sinai: Dr. Pimentel links IBS and gut bacteria – Cedars-Sinai (see other variants of it here: Daily Disruption – Cedars-Sinai Study Links Irritable Bowel Syndrome (IBS) and Bacteria in Gut and here: Irritable bowel syndrome clearly linked to gut bacteria).

Among the things that bug me here:

They don’t include a link to the paper or even provide a citation
They claim that culturing microbes is the “gold standard” for connecting bacteria to the cause of this disease. AND they imply this is the first method to use culturing to study the disease. Both notions are wrongheaded.
They confuse cause of IBS and symptoms. They say that b/c antibiotics help reduce symptoms, therefore, bacteria cause the disease. Really? So then fevers must cause things like malaria and flu because ibuprofen helps reduce symptoms right?
At some point it might be nice to mention that the MD behind the new study has also been pushing the idea that IBS is caused by bacterial overgrowth for many years both in a book and via a testing company though it is unclear what his association with the company is. I note – ads for the book claim ” In addition, Dr. Pimentel presents a simple treatment protocol that will not only help you resolve your IBS symptoms, but will also prevent their recurrence.” So – apparently he already had a cure BEFORE the new study was even done. I general I am skeptical of papers that show evidence for something coming from someone who apparently already “knew” the answer.

Of course, I am not saying IBS is NOT caused by bacterial overgrowth as they claim. But I can say this – PRs like this make me skeptical that anything new was done in this current publication.

Nice Collection from Diane Dawson: Open Science and Crowd Science: Selected Sites and Resources

Quick post here – already posted to Twitter and wanted to make sure this one was seen by people who read this blog but don’t follow me on Twitter.

There is a nice compilation/commentary/review from Diane Dawson titled Open Science and Crowd Science: Selected Sites and Resources. It is in the journal “Issues in Science and Technology Librarianship” (which I note – is a new one to me). It has a lot of useful resources and comments about various open science activities on the web. Definitely worth checking out.

Yum – Carbon monoxide, worms, bacteria – all together – what could be better

Just a quick one here pointing people to a paper and some stories relating to work by Nicole Dubilier on the worm Olavius algarvensis and it’s chemosynthetic symbionts.

The paper: “Metaproteomics of a gutless marine worm and its symbiotic microbial community reveal unusual pathways for carbon and energy use”: Full Text (They paid for the PNAS Open Access option)
Research | Research news | A toxic menu
Elba: Scientists study gutless marine worm
Protein analysis investigates marine worm community (Press Release)
La asombrosa simbiosis de un gusano con bacterias le permite …
Un ver marin qui se nourrit de composés toxiques
Seltsamer Meereswurm ernährt sich von Giften

Guest post: Story Behind the Paper by Joshua Weitz on Neutral Theory of Genome Evolution

I am very pleased to have another in my “Story behind the paper” series of guest posts. This one is from my friend and colleague Josh Weitz from Georgia Tech regarding a recent paper of his in BMC Genomics. As I have said before – if you have published an open access paper on a topic related to this blog and want to do a similar type of guest post let me know …

—————————————-
A guest blog by Joshua Weitz, School of Biology and Physics, Georgia Institute of Technology

Summary This is a short, well sort-of-short, story of the making of our paper: “A neutral theory of genome evolution and the frequency distribution of genes” recently published in BMC Genomics. I like the story-behind-the-paper concept because it helps to shed light on what really happens as papers move from ideas to completion. It’s something we talk about in group meetings but it’s nice to contribute an entry in this type of forum. I am also reminded in writing this blog entry just how long science can take, even when, at least in this case, it was relatively fast.

The pre-history The story behind this paper began when my former PhD student, Andrey Kislyuk (who is now a Software Engineer at DNAnexus) approached me in October 2009 with a paper by Herve Tettelin and colleagues. He had read the paper in a class organized by Nicholas Bergman (now at NBACC). The Tettelin paper is a classic, and deservedly so. It unified discussions of gene variation between genomes of highly similar isolates by estimating the total size of the pan and core genome within multiple sequenced isolates of the pathogen Streptococcus agalactiae.

However, there was one issue that we felt could be improved: how does one extrapolate the number of genes in a population (the pan genome) and the number of genes that are found in all individuals in the population (the core genome) based on sample data alone? Species definitions notwithstanding, Andrey felt that estimates depended on details of the alignment process utilized to define when two genes were grouped together. Hence, he wanted to evaluate the sensitivity of core and pan geonme predictions to changes in alignment rules. However, it became clear that something deeper was at stake. We teamed up with Bart Haegeman, who was on an extended visit in my group from his INRIA group in Montpellier, to evaluate whether it was even possible to quantitatively predict pan and core genome sizes. We concluded that pan and core genome size estimates were far more problematic than had been acknowledged. In fact, we concluded that they depended sensitively on estimating the number of rare genes and rare genomes, respectively. The basic idea can be encapsulated in this figure:

The top panels show gene frequency distributions for two synthetically generated species. Species A has a substantially smaller pan genome and a substantially larger core genome than does Species B. However, when one synthetically generates a sample set of dozens, even hundreds of genomes, then the rare genes and genomes that correspond to differences in pan and core genome size, do not end up changing the sample rarefaction curves (seen at the bottom, where the green and blue symbols overlap). Hence, extrapolation to the community size will not necessarily be able to accurately estimate the size of the pan and core genome, nor even which is larger!

As an alternative, we proposed a metric we termed “genomic fluidity” which captures the dissimilarity of genomes when comparing their gene composition.

The quantitative value of genomic fluidity of the population can be estimated robustly from the sample. Moreover, even if the quantitative value depends on gene alignment parameters, its relative order is robust. All of this work is described in our paper in BMC Genomics from 2011: Genomic fluidity: an integrative view of gene diversity within microbial populations.

However, as we were midway through our genomic fluidity paper, it occurred to us that there was one key element of this story that merited further investigation. We had termed our metric “genomic fluidity” because it provided information on the degree to which genomes were “fluid“, i.e., comprised of different sets of genes. The notion of fluidity also implies a dynamic, i.e., a mechanism by which genes move. Hence, I came up with a very minimal proposal for a model that could explain differences in genomic fluidity. As it turns out, it can explain a lot more.

A null model: getting the basic concepts together
In Spring 2010, I began to explore a minimal, population-genetics style model which incorporated a key feature of genomic assays, that the gene composition of genomes differs substantially, even between taxonomically similar isolates. Hence, I thought it would be worthwhile to analyze a model in which the total number of individuals in the population was fixed at N, and each individual had exactly M genes. Bart and I started analyzing this together. My initial proposal was a very simple model that included three components: reproduction, mutation and gene transfer. In a reproduction step, a random individual would be selected, removed and then replaced with one of the remaining N-1 individuals. Hence, this is exactly analogous to a Moran step in a standard neutral model. At the time, what we termed mutation was actually representative of an uptake event, in which a random genome was selected, one of its genes was removed, and then replaced with a new gene, not found in any other of the genomes. Finally, we considered a gene transfer step in which two genomes would be selected at random, and one gene from a given genome would be copied over to the second genome, removing one of the previous genes. The model, with only birth-death (on left) and mutation (on right), which is what we eventually focused on for this first paper, can be depicted as follows:

We proceeded based on our physics and theoretical ecology backgrounds, by writing down master equations for genomic fluidity as a function of all three events. It is apparent that reproduction decreases genomic fluidity on average, because after a reproduction event, two genomes have exactly the same set of genes. Likewise, gene transfer (in the original formulation) also decreases genomic fluidity on average, but the decrease is smaller by a factor of 1/M, because only one gene is transferred. Finally, mutation increases genomic fluidity on average, because a mutation event occurring at a gene which had before occurred in more than one genome, introduces a new singleton gene in the population, hence increasing dissimilarity. The model was simple, based on physical principles, was analytically tractable, at least for average quantities like genomic fluidity, and moreover it had the right tension. It considered a mechanism for fluidity to increase and two mechanisms for fluidity to decrease. Hence, we thought this might provide a basis for thinking about how relative rates of birth-death, transfer and uptake might be identified from fluidity. As it turns out, many combinations of such parameters lead to the same value of fluidity. This is common in models, and is often referred to as an identifiability problem. However, the model could predict other things, which made it much more interesting.

The making of the paper
The key moment when the basic model, described above, began to take shape as a paper occurred when we began to think about all the data that we were not including in our initial genomic fluidity analysis. Most prominently, we were not considering the frequency at which genes occurred amongst different genomes. In fact, gene frequency distributions had already attracted attention. A gene frequency distribution summarizes the number of genes that appear in exactly k genomes. The frequency with which a gene appears is generally thought to imply something about its function, e.g., “Comprising the pan-genome are the core complement of genes common to all members of a species and a dispensable or accessory genome that is present in at leastbone but not all members of a species.” (Laing et al., BMC Bioinformatics 2011). The emphasis is mine. But does one need to invoke selection, either implicitly or explicitly, to explain differences in gene frequency?

As it turns out, gene frequency distributions end up having a U-shape, such that many genes appear in 1 or a few genomes, many in all genomes (or nearly all), and relatively few occur at intermediate levels. We had extracted such gene frequency distributions from our limited dataset of ~100 genomes over 6 species. Here is what they look like:

And, when we began to think more about our model, we realized that the tension that led to different values of genomic fluidity also generated the right sort of tension corresponding to U-shaped gene frequency distributions. On the one-hand, mutations (e.g., uptake of new genes from the environment) would contribute to shifting the distribution to the left-hand-side of the U-shape. On the other hand, birth-death would contribute to shifting the distribution to the right-hand side of the U-shape. Gene transfer between genomes would also shift the distribution to the right. Hence, it seemed that for a given set of rates, it might be possible to generate reasonable fits to empirical data that would generate a U-shape. In doing so, that would mean that the U-shape was not nearly as informative as had been thought. In fact, the U-shape could be anticipated from a neutral model in which one need not invoke selection. This is an important point as it came back to haunt us in our first round of review.

So, let me be clear: I do think that genes matter to the fitness of an organism and that if you delete/replace certain genes you will find this can have mild to severe to lethal costs (and occasional benefits). However, our point in developing this model was to try and create a baseline null model, in the spirit of neutral theories of population genetics, that would be able to reproduce as much of the data with as few parameters as possible. Doing so would then help identify what features of gene compositional variation could be used as a means to identify the signatures of adaptation and selection. Perhaps this point does not even need to be stated, but obviously not everyone sees it the same way. In fact, Eugene Koonin has made a similar argument in his nice paper, Are there laws of adaptive evolution: “the null hypothesis is that any observed pattern is first assumed to be the result of non-selective, stochastic processes, and only once this assumption is falsified, should one start to explore adaptive scenarios”. I really like this quote, even if I don’t always follow this rule (perhaps I should). It’s just so tempting to explore adaptive scenarios first, but it doesn’t make it right.

At that point, we began extending the model in a few directions. The major innovation was to formally map our model onto the infinitely many alleles model of population genetics, so that we could formally solve our model using the methods of coalescent theory for both cases of finite population sizes and for exponentially growing population sizes. Bart led the charge on the analytics and here’s an example of the fits from the exponentially growing model (the x-axis is the number of genomes):

At that point, we had a model, solutions, fits to data, and a message. We solicited a number of pre-reviews from colleagues who helped us improve the presentation (thank you for your help!). So, we tried to publish it.

Trying to publish the paper
We tried to publish this paper in two outlets before finding its home in BMC Genomics. First, we submitted the article to PNAS using their new PNAS Plus format. We submitted the paper in June 2011 and were rejected with an invitation to resubmit in July 2011. One reviewer liked the paper, apparently a lot: “I very much like the assumption of neutrality, and I think this provocative idea deserves publication.” The same reviewer gave a number of useful and critical suggestions for improving the manuscript. Another reviewer had a very strong negative reaction to the paper. Here was the central concern: “I feel that the authors’ conclusion that the processes shaping gene content in bacteria and primarily neutral are significantly false, and potentially confusing to readers who do not appreciate the lack of a good fit between predictions and data, and who do not realise that the U-shaped distributions observed would be expected under models where it is selection that determines gene number.” There was no disagreement over the method or the analysis. The disagreement was one of what our message was.

I still am not sure how this confusion arose, because throughout our first submission and our final published version, we were clear that the point of the manuscript was to show that the U-shape of gene frequency distributions provide less information than might have been thought/expected about selection. They are relatively easy to fit with a suite of null models. Again, Koonin’s quote is very apt here, but at some basic level, we had an impasse over a philosophy of the type of science we were doing. Moreover, although it is clear that non-neutral processes are important, I would argue that it is also incorrect to presume that all genes are non-neutral. There’s lots of evidence that many transferred genes have little to no effect on fitness. We revised the paper, including and solving alternative models with fixed and flexible core genomes, again showing that U-shapes are rather generic in this class of models. We argued our point, but the editor sided with the negative review, rejecting our paper in November after resubmission in September, with the same split amongst the reviewers.

Hence, we resubmitted the paper to Genome Biology, which rejected it at the editorial level after a few week delay without much of an explanation, and at that point, we decided to return to BMC Genomics, which we felt had been a good home for our first paper in this area and would likely make a good home for the follow-up. A colleague once said that there should be an r-index, where r is the number of rejections a paper received before ultimate acceptance. He argued that r-indices of 0 were likely not good (something about if you don’t fall, then you’re not trying) and an r-index of 10 was probably not good either. I wonder what’s right or wrong. But I’ll take an r of 2 in this case, especially because I felt that the PNAS review process really helped to make the paper better even if it was ultimately rejected. And, by submitting to Genome Biology, we were able to move quickly to another journal in the same BMC consortia.

Upcoming plans
Bart Haegeman and I continue to work on this problem, from both the theory and bioinformatics side. I find this problem incredibly fulfilling. It turns out that there are many features of the model that we still have not fully investigated. In addition, calculating gene frequency distributions involves a number of algorithmic challenges to scale-up to large datasets. We are building a platform to help, to some extent, but are looking for collaborators who have specific algorithmic interests in these types of problems. We are also in discussions with biologists who want to utilize these types of analysis to solve particular problems, e.g., how can the analysis of gene frequency distributions be made more informative with respect to understanding the role of genes in evolution and the importance of genes to fitness. I realize there are more of such models out there tackling other problems in quantitative population genomics (we cite many of them in our BMC Genomics paper), including some in the same area of understanding the core/pan genome and gene frequency distributions. I look forward to learning from and contributing to these studies.

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: