Jonathan Eisen's Lab

Lab meeting, May 23rd, 2012

Ladan Doroud will be presenting for this week’s lab meeting.
We will meet in the Genome Center in room 5206 from 1:30 to 3pm.

Story behind the paper guest post on "Resolving the ortholog conjecture"

This is another in my ongoing “Story behind the paper series”. This one is from Christophe Dessimoz on a new paper he is an author on in PLoS Computational Biology that is near and dear to my heart.

See below for more. I am trying to post this from Yosemite National Park without full computer access so I hope the images come through. If not I will fix in a few days.

……………..

I’d like to thank Jonathan for the opportunity to tell the story behind our paper, which was just published in PLoS Computational Biology. In this work, we corroborated the “ortholog conjecture”—the widespread but little tested notion that orthologs tend to be functionally more conserved than paralogs.

I’d also like to explore more general issues, including the pitfalls of statistical analyses on highly heterogeneous data such as the Gene Ontology, and the pivotal role of peer-reviewers.

Like many others in computational biology, this project started as a quick analysis that was meant to take “just a few hours” but ended up keeping us busy for several years…

The ortholog conjecture and alternative hypotheses

The ortholog conjecture states that on average and for similar levels of sequence divergence, genes that started diverging through speciation (“orthologs”) are more similar in function than genes that started diverging through duplication (“paralogs”). This is based on the idea that gene duplication is a driving force behind function innovation. Intuitively, this makes sense because the extra copy arising through duplication should provide the freedom to evolve new function. This is the conventional dogma.

Alternatively, for similar levels of sequence divergence, there might not be any particular difference between orthologs and paralogs. It is the simplest explanation (per Ockham’s razor), and it also makes sense if the function of a gene is mainly determined by its protein sequence (let’s just consider one product per gene). Following this hypothesis, we might expect considerable correlation between sequence and function similarity.

But these are by no means the only two possible hypotheses. Notably, Nehrt and colleagues saw higher function conservation among within-species homologs than between-species homologs, which prompted them to conclude: “the most important aspect of functional similarity is not sequence similarity, but rather contextual similarity”. If the environment (“the context”) is indeed the primary evolutionary driving force, it is not unreasonable to speculate that within-species paralogs could evolve in a correlated manner, and thus be functionally more similar than their between-species counterparts.

Why bother testing these hypotheses?

Testing these hypotheses is important not only for better understanding gene function evolution in general, but it also has practical implications. The vast majority of gene function annotations (98% of Gene Ontology annotations) are propagated computationally from experimental data on a handful of model organisms, often using models based on these hypotheses.

How our work started

Our project was born during a break at the 10^th anniversary meeting of the Swiss Institute of Bioinformatics in September 2008. I was telling Marc Robinson-Rechavi (University of Lausanne) about my work with Adrian Altenhoff on orthology benchmarking (as it happens, another paper edited by Jonathan!), which had used function similarity as a surrogate measure for orthology. We had implicitly assumed that the ortholog conjecture was true—a fact that Marc zeroed in on. He was quite sceptical of the ortholog conjecture, and around this time, together with his graduate student Romain Studer, he published an opinion in Trends in Genetics unambiguously entitled “How confident can we be that orthologs are similar, but paralogs differ?” (self-archived preprint). So, having all that data on hand, we flipped our analysis on its head and set out to compare the average Gene Ontology (GO) annotation similarity of orthologs vs. paralogs. Little did we think that this analysis would keep us busy for over 3 years!

First attempt

It only took a few weeks to obtain our first results. But we were very puzzled. As Nehrt et al. would later publish, we observed that within-species paralogs tended to be functionally more conserved than orthologs. At first we were very sceptical. After all, Marc had been leaning toward the uniform ortholog/paralog hypothesis, and I had expected the ortholog conjecture to hold. We started controlling for all sorts of potential sources of biases and structure in the data (e.g. source of ortholog/paralog predictions, function and sequence similarity measures, variation among species clades). A year into the project, our supplementary materials had grown to a 67-page PDF chock-full of plots! The initial observation held under all conditions. By then, we were starting to feel that our results were not artefactual and that it was time to communicate our results. (We were also running out of ideas for additional controls and were hoping that peer-reviewers might help!)

Rejections

We tried to publish the paper in a top-tier journal, but our manuscript was rejected prior to peer-review. I found it frustrating that although the work was deemed important, they rejected it prior to review over an alleged technical deficiency. In my view, technical assessments should be deferred to the peer-review process, when referees have the time to scrutinise the details of a manuscript.

Genome Research sent our manuscript out for peer-review, and we received one critical, but insightful report. The referee contended that our results were due to species-specific factors, which arise because “paralogs in the same species tend to be ‘handled’ together, by experimenters and annotators”. The argument built on one example which we had discussed in the paper: S. cerevisiae Cdc10/Cdc12 and S. pombe Spn2/Spn4 are paralogs inside each species, while Cdc10/Spn2 and Cdc12/Spn4 are the respective pairs of orthologs. The Gene Ontology annotations for the orthologs were very different, while the annotations of theparalogs were very similar. The reviewer looked at the source articles indetail, and noticed that “the functional divergence between these genes is more apparent than real”. Both pairs of paralogs were components of theseptin ring. The differences in annotation appeared to be due to differences in the experiments done and in the way they were transcribed. The reviewer stated:

“A single paper will often examine the phenotype of several paralogs within onespecies, resulting in one paper, which is presumably processed by one GO annotator at one time. In contrast, phenotypes of orthologs in different species almost always come from different papers, via different annotation teams.”

Authorship effect: an easily overlooked bias

At first, it was tempting to just dismiss the criticism. After all, as Roger Brinner put it, “the plural of anecdote is not data.” More importantly, we had tried to address several potential species-specific biases, such as uneven annotation frequencies among species (e.g. due to developmental genes being predominantly studied in C. elegans). And we had been cautious in our conclusions, suggesting that our results might be due to an as yet unknown confounder in the Gene Ontology dataset (remember that we had run out of ideas?). So the referee was not telling us anything we did not know.

Or was (s)he? Stimulated by the metaphor of same-species paralogs being “handled” together, we decided to investigate whether common authorship might correlate in any way with function annotation similarity. Here’s what we observed:

The similarity of function annotations from a common paper is much higher than otherwise! Even if we restrict ourselves to annotations from different papers, but with at least one author in common, the similarity of functional annotations is still considerably higher than with papers without a common author.

Simpson’s Paradox

In itself, the authorship effect is not necessarily a problem: if annotations between orthologs and paralogs were similarly distributed among the different types, differences due to authorship effects would average out. The problem here is that paralogs are one order of magnitude more likely to be annotated by the same lab than orthologs. This gives rise to “Simpson’s paradox”: paralogs can appear to be functionally more similar than orthologs just because paralogs are much more likely to be studied by the same people.

A classical example of Simpson’s paradox is the “Berkeley gender bias case” (Wikipedia article): the university was sued for bias against women applicants based on the aggregate 1973 admission figures (44% men admitted vs. 35% women). As it turned out, the admission rate for each department was in fact similar for both sexes (and even in favour of women in a few departments). The lower overall acceptance rate for women was not due to gender bias, but to the tendency of women to apply to more competitive departments.

Paper by Nehrt et al.

Finding the authorship effect meant that we had to reanalyse all our data, and completely rewrite our manuscript. A few months into this process, in June 2011, Matt Hahn and colleagues published their paper (Nehrt et al., Testing the Ortholog Conjecture with Comparative Functional Genomic Data from Mammals, PLoS Comput Biol 2011). Matt has written a very interesting story behind the paper on this blog, which is well-worth reading (including comments).

While we weren’t very surprised by the essence of their observations—they were very consistent with our initial (rejected) manuscript—we were nevertheless struck by the similarity in the presentation of the results:

On the left, plot 2A from Nehrt et al., PLoS Comput Biol 2011; On the right, plot from our initial, rejected manuscript. Note that their blue (within-spec outparalogs) and green (inparalogs) lines are combined in our plot (same-species paralogs, yellow line)

The publication of Nehrt et al.’s gave us mixed feelings. Obviously, their work was taking away some of the novelty in our study. But at the same time, they were drawing considerable attention to the problem (not least by coining a catchy name to describe the question!). And of course, we already knew at this point that their observations were confounded by factors such as the authorship effect, so it was not the end of the story.

Is it possible to draw reliable conclusions from observational data such as GO annotations?

Before I move to our findings, it’s worth pondering a bit more on the issue of biases in the data. Statisticians and epidemiologists make a strong distinction between experimental data (=data from a controlled experiment, designed such that the case and control groups are as identical as possible in all respects other than a factor of interest) and observational data (=data lying around). (On a side note, Ewan Birney recently wrote a great post on study design in genetic and phenotypic studies, with several ideas relevant to this issue.) Data pulled from the GO database clearly falls into the second category: observational data. We are at the mercy of potentially countless hidden effect biasing our conclusions in all sorts of ways.

Can we rely on this data at all? For some, the answer appears to be a categorical “no“. A more pragmatic stance was expressed by the GO consortium in a recent reply to Nehrt et al.’s paper, which identified potential confounding effects ignored in the study of Nehrt et al., such as species-specific annotation biases (they suggested, tongue-in-cheek, that the study instead supports the “biased annotation conjecture”), and stressed that “users of GO should ensure that they test for, and adjust for, potential biases prior to interpretation”.

In the end, I think that this debate, consistent with our experience, highlights the risks of working with observational data. But at the same time, observational data is often all we have (or can afford), so the best course of action seems indeed to control for known confounders, try to identify unknown ones, and cautiously process forward.

Resolving the conjecture

Controlling for the authorship effect and other biases—some previously known, others newly identified—we found that for similar levels of divergence, orthologs tend to be functionally more conserved than paralogs. This is true for different source of orthology/paralogy predictions, different types of function and sequence similarity measures, and different data sampling strategies. But in absolute terms, the difference is often not very strong, and varies considerably among species and types of functions. Our study confirms the ortholog conjecture, but at the same time it shows that the conjecture is not very useful in practice as it does not have much predictive power.

They were two crucial contributors to this study: the peer-reviewers, and open science. We are obviously indebted to the reviewer who rejected our paper on the basis of a potential authorship bias. The reviewers of the resubmission provided detailed and highly competent feedback As for open science, where would computational biology be without it? We often take it for granted, but without publicly available genome data and functional annotations, a study like this would never have happened. Chemistry suffers from having only very few publicly available databases (e.g. ChEMBL). People hard at work behind model organism databases have our deepest appreciation.

Upcoming discussions

As far as I can tell, Nehrt et al.’s study caused a considerable stir in our community, but many critics could not quite put their finger precisely on what was wrong (see Matt Hahn’s post, especially section “The fallout, and some responses”). Our work explains and reconciles their controversial observations.

This is by no means the end of the discussion. There will be a symposium dedicated to the ortholog conjecture at SMBE in Dublin next month (let me know if you want to meet up), and it will almost certainly be a topic of the next Quest for Orthologs meeting (tentatively scheduled for the summer 2013). Meanwhile, I hope this work will spur discussions on this blog (or, in French, on Marc Robinson-Rechavi’s blog!) and/or on Twitter (you can follow me at @cdessimoz, and Marc at @marc_rr).

Thanks to Marc Robinson-Rechavi and Mary Todd Bergman for their feedback on drafts of this post.

Interesting report from White House: National Bioeconomy Blueprint

Been reading the “National Bioeconomy Blueprint” from the WhiteHouse. It is is definitely worth checking out (for some background information see his blog post from the White House: National Bioeconomy Blueprint Released | The White House and this NY Times article White House Promotes a Bioeconomy – NYTimes.com from last month). Also check out the main page describing this document: National Bioeconomy Blueprint | The White House.

The blueprint outlines five main objectives:

Support R&D investments that will provide the foundation for the future U.S. bioeconomy.
Facilitate the transition of bioinventions from research lab to market, including an increased focus on translational and regulatory sciences.
Develop and reform regulations to reduce barriers, increase the speed and predictability of regulatory processes, and reduce costs while protecting human and environmental health.
Update training programs and align academic institution incentives with student training for national workforce needs.
Identify and support opportunities for the development of public-private partnerships and precompetitive collaborations—where competitors pool resources, knowledge, and expertise to learn from successes and failures.

And then goes through some background and recommendations to help achieve these objectives.

Other discussion of this includes:

Something fishy with this story: bacteria in fish pedicures

Well, the title drew me in, without a doubt: Fish Pedicures: Bacteria in Your Foot Soak.

To start with _ i guess I have been out of touch as I have never heard of fish pedicures before. Sounds lovely I must say.

Though if you are considering doing this you might be dissuaded by some of the revelations in the article including that “fish are living creatures that deposit their waste products in the very water in which people are soaking” and “the impossibility of disinfecting or sanitizing live fish.”

Amazingly, fish pedicures are in fact apparently quite popular. So popular that there are multiple investigations relating to this practice including that “British authorities investigated a reported bacterial outbreak among 6,000 Garra rufa fish ” and “Last spring, British fish inspectors went to London’s Heathrow Airport and intercepted Indonesian shipments of the silver, inch-long freshwater carp destined for British “fish spas.”

And now – the reason for this article – there is a new report in the journal Emerging Infectious Diseases on “Zoonotic Disease Pathogens in Fish Used for Pedicure.” The article is actually somewhat fascinating and thanks to the CDC it is freely available.

Fun reading for the day …

Electronic Lab Notebook tech demo at #UCDavis 5/18

Just got this email and am sharing

Electronic Laboratory Notebooks–an information breakthrough for your lab?

Friday, May 18

3 pm

5206 GBSF

** AND **

12-2 pm

Day On the Dock

Behind Haring Hall

Dear UC Davis Researchers,

Is your research group tablet- or iPad- compatible? Would you like to find out?

Please come to a demonstration of electronic laboratory notebooks (ELN) in GBSF 5206 on Friday, May 18 at 3 pm. Rory MacNeil, from Axiope, (http://www.axiope.com) will be visiting from Scotland to discuss ELN in general, and the software he helped design, eCAT. This application is very suitable for academic research labs and
collaboration among a group.

If you cannot attend this demonstration, Rory will be at Day On the Dock, station #1, behind Haring Hall between 12-2 pm. Please stop by.

1. You may watch an extended (80 min) overview of ELN from LabManager webcast,
http://www.labmanager.com/?articles.view/articleNo/4575/article/Webinar–Next-Generation-Electronic-Laboratory-Notebooks–ELNs- (This presentation is primarily targeted to commercial labs; the latest generation of ELN’s are compatible with basic research groups. Three additional vendors discuss their products (Agilent, Accelrys, Waters).

2. You may watch the following brief video introducing inventory management in eCAT
http://www.axiope.com/electronic-lab-notebook/blog/product/?p=201

3. You may also sign up for a free personal account or a free trial of the group versions at
http://www.axiope.com/electronic-lab-notebook/free_trial.php

2012 #UCDavis Faculty Research Lecture Award-Michael Turelli 6/6 4PM “How good luck, great collaborators, pretty mathematics and a maternally inherited bacterium (Wolbachia) may stop the spread of dengue fever”

Michael Turelli

Distinguished Professor of Genetics

Department of Evolution and Ecology

and The Center for Population Biology

Recipient of the

2012 UC DAVIS

ACADEMIC SENATE

FACULTY RESEARCH LECTURE AWARD

“How good luck, great collaborators, pretty mathematics and a maternally inherited bacterium (Wolbachia) may stop the spread of dengue fever”

Public Lecture

June 6, 2012

4:10 p.m.

1322 Storer Hall

Turelli Lecture Flyer – listserve.docx

Kocuria Rosea, I choose you!

In a previous post, I noted that I was currently working on identifying some environmental isolates I took from various locations. When I got my 16S sequences back, I had the ever so popular Micrococcus, Staphylococcus, and Bacillus bacteria in most of my samples. Two that stood out were Kocuria Kristinae (OTW) and Kocuria Rosea (OTCP). Of the Kocuria species, only one has been completed and published, Kocuria Rhizophilia. One other is in permanent draft, and another three are targeted. By using BLAST on my two samples 16S gene, I determined that both are 96% related at the 16S gene level to the published species. Because of this difference, both samples are fairly distantly related to Kocuria Rhizophilia.

Now came the hard question of which one to use for library construction. Both are very closely related and found in the same general type of environment. I checked my genomic DNA concentration of the two samples and it turned out that I did not have much genomic DNA of Kocuria Kristinae and had plenty genomic DNA (about 785 ng/µL) of Kocuria Rosea. Thus, it was more practical for me to use Kocuria Rosea for my library construction project.

To give you a little information on Kocuria Rosea, it is a type of soil bacterium that has been found in various locations such as a polluted soil, indoor environments, deep sea sediments, and a spacecraft. But mostly, it is isolated from soil and water. A paper online has also attributed it to a catheter-related bacterium.

Candidate for Sequencing- THP

It is very exciting that I have finally found an organism worth sequencing. After submitting 9 different samples to be sequenced I have obtained a potential candidate to be fully sequenced and that organism is named THP. THP stands for toilet handle pink colony. The original sample for THP was taken from the toilet handle in my apartment bathroom. After running the sequence through BLAST I discovered that the sequence is to an unknown species in the genus Dietzia.

After finding this information, I then looked for how many completed, incomplete and targeted projects there were in GOLD. Here I discovered that there is one completed project for Dietzia alimentaria 72 and an incomplete project for a different species Dietzia cinnamea. Although there are already two different species under the same genus, that have already been sequenced or will be sequenced, I think that THP will be a good candidate to sequence because it can potentially be a new species. If THP is a new species we can use the sequence to compare and contrast with both alimentaria and cinnamea. Along with doing this, this unknown species can potentially tell us more about organisms that live on things we use on a daily basis.

The next thing David and I did was track down the sequences of both Dietza alimentaria and cinnamea and match those sequences against the THP sequence. After further analyzing the sequences of both Dietzia alimentaria and cinnamea we discovered that alimentaria 72 has 98.1% identity to ours, 67% GC and cinnamea 97.6% identity to ours, 70.9% GC. These two species are about 97 % identical to each other. Dietzia alimentaria is from traditional fermented Korean food and Dietzia cinnamea is found in petroleum contaminated soil in Brazil. Some interesting information about Dietzia cinnamea is that it is able to degrade petroleum hydrocarbons. I found it to be really interesting that these two species are in the same genus because they are found in very different environments.

As of right now, the genomic prep I had originally made of THP did not have enough genomic DNA to begin to constructing a genomic library. Therefore I am in the process of making a new genomic prep, hopefully with an abundant amount of DNA.

Sign of the apocalypse? Science conference SPAM hybridizes w/ Nigerian advanced fee SPAM.

Normally I do not share SPAM emails. But I have posted occasionally about Journal SPAM and Conference SPAM. So what do I do here. I just received an email that appears to be a hybrid between Conference SPAM and Nigerian advanced fee SPAM. OMG. The merging of two SPAM systems. Too bad the Conference is not about viagra – though since it is about Metagenomics perhaps it somehow got flagged due to studies of the penis or vagina microbiome? In this case I just had to post …

Dear Honored Sir or Madam,

I am Prof. Mohammed Kaoje Abubakar minister of Science for the Republic of Nigeria under former President Alhaji Umaru Musa Yar’Adua. In this role I became in control of large sums of money dedicated to scientific research and the exchange of ideas with researchers from across the globe. However, since the time of the unfortunate death of President Yar’Adua, I have been under intense scrutiny of the new director of the ministry and have been unable to complete my mission. Fortunately, I remain in charge of most of the 200 million dollars US, but the current government will only release the funds in conjunction with scientific activities involving prominent foreign national scientists like yourself.

Therefore, on behalf of the 3rd Nigerian Congress of Metagenomics I am pleased to welcome you to propose a speech on your recent discovery about the genomic basis for the origin and evolution of new functions at the congress by submitting your speech title and CV to us. Meanwhile, we hope you can share your stimulating data, valuable scientific information and influential experiences with other industrial leaders, professionals and research pioneers. You are encouraged to network and explore partnering opportunities.

As a branded Conference of Nigeria Congresses LLC, “Your Think Tank”, NCM continues to expand with magnificent scientific and social programs to maximize your network in a free communication meeting environment.

Activities of NCM 2012

l Keynote Forum – Presentations from Nobel Prize Laureate and Senior Leaders of Renowned Company

l Parallel Forum – 200+ Sessions and Symposiums provide 1000+ speech opportunities for experts from all of the world

l Welcome Banquet – All the participants enjoy the formal buffet dinner with wonderful performance show

l Project Matching Activity – Develop effective platform by free booths supply

l Keymakers Summit – Special Forum for Enterprisers to discuss hot issues face to face

l Exhibition and Poster Zone

The 3rd Nigerian Congress of Metagenomics is initiated for filling the gap between Eastern and West World for metagenomic professionals of free information exchange. In the past decade, NCM has attracted more than 5,000 enthusiastic speakers to communicate on the R & D advances in different therapeutic fields, which have generated great impact on the Chinese Bio/pharmaceutical development, enhanced Research and Development outsourcing, helped regional liaison of big pharma seeking partnership and searching talents, created a lot of opportunities for face-to-face network for multilateral collaboration by sharing both scientific and technological breakthroughs and speed up the process of many challenging drug discovery projects.

For more information PS: http:www.ncm.ng

Warri is a major oil city in Delta State, Nigeria, with a population of over 300,000 people. We look forward to seeing you in Warri for a stimulating and enjoyable conference. Kindest regards,

Prof. Mohammed Kaoje Abubakar8 for the organizing committee.

Phyloseminar: David Pollock 5/30 10am PST “Adaptation, coevolution, & convergence in the context of protein thermodynamics”

Next talk at http://phyloseminar.org/

"Adaptation, coevolution, and convergence in the context of protein thermodynamics"

David Pollock (University of Colorado School of Medicine)

Interactions within and between proteins are a fundamentally important part of how they evolve and adapt. We have been considering how and why proteins adapt, coevolve, and converge, and working to understand these concepts in the context of protein thermostability and function.

We will expand from the previous talk of our collaborator, Dr.
Goldstein, and discuss how and why coevolution is and should be detected, and how thermostability affects reconstruction of ancestral functions. Further, we will discuss our work on adaptive redesign in mitochondrial proteins, perhaps the largest known case of an adaptive burst in multiple metabolic proteins. The convergence between ancestral snakes and ancestral acrodont lizards is also perhaps the largest known case of adaptive convergence. We will consider what these examples tell us about the theory of how proteins appear to evolve in the context of nearly neutral versus cases of adaptive change. Further, we will discuss the impact on understanding phylogenetic relationships, and we will also discuss a unified theory of nearly neutral and adaptive evolution in the context of structure and function.

West Coast USA: 10:00 (10:00 AM) on Wednesday, May 30
East Coast USA: 13:00 (01:00 PM) on Wednesday, May 30
UK: 18:00 (06:00 PM) on Wednesday, May 30
France: 19:00 (07:00 PM) on Wednesday, May 30
Japan: 02:00 (02:00 AM) on Thursday, May 31
New Zealand: 05:00 (05:00 AM) on Thursday, May 31

Share this:

How our work started

Authorship effect: an easily overlooked bias

Simpson’s Paradox

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: