Advice needed from a future reviewer…

I found myself writing this email to some collaborators, but halfway through realized that it’d be nice to get EVERYBODY’s input. Probably, one of you is going to review my next paper, so how awesome would it be for you to just tell me what you think now, and make both of our lives easier later.

To test whether taxa vary significantly across groups of samples, we first need to filter the OTU table to get rid of OTUs that are not present in most of the samples and/or that do not vary across samples. This must happen for statistical reasons.

As far as I know, there are two ways to do this. One, is to remove OTUs that occur in fewer than 25% of the samples (25% is suggested by the QIIME folks). The other is to calculate the variance of the OTUs across samples and remove the OTUs that have a variance less than 0.00001 (0.00001 is an arbitrary number thrown out there by the phyloseq developer.)

A third option would be to apply both criteria.

My inclination would be to go with the third option, but mostly because I want to limit as much as possible the number of hypothesis tests that we do in order to avoid draconian p-value corrections.

I’m not a big fan of arbitrary thresholds, but they are so frequently required that I’ve made my peace with them. However, if someone can suggest a non-arbitrary threshold, that’d be great.

But mostly, I want to make sure that everyone agrees now on the method that we use so that I only have to do this once. Thoughts?

installing STAMP on a Mac

Since this was such a huge pain in the ass for so many of us, I figured I’d share what finally worked for me.

First this:
pip install STAMP

then I got an error about matplotlib, so then this:
pip install matplotlib

Now, I type STAMP and it launches.

Of course, I did a hundred other things before I tried this, any number of which may or may not have contributed to the ease of this solution. But, if you’re still trying to get STAMP installed, give this a try.

“pip install STAMP” was a suggestion by Tracy Teal, btw. Can’t wait for her and Titus to get to Davis!

Do we need naming regulations for computer software?

Well, just saw this new paper: BMC Bioinformatics | Full text | Bellerophon: a hybrid method for detecting interchromo-somal rearrangements at base pair resolution using next-generation sequencing data.  Seems potentially interesting.  But one part of it struck me as very awkward.  You see, there already is a Bellerophon software program used by many in my field: Bellerophon: a program to detect chimeric sequences in multiple sequence alignments.  Seems like a very bad idea to have a new program with the same name as an existing (and still used) one in a similar general field (DNA sequence analysis).

This leads me to the following question – do we need some sort of naming guidelines or regulations for computer software?  We have all sorts of naming regulations and conventions for genes, for species, for other groups of taxa, and more.  Why not software tools?  But seriously, I don’t think we need such a thing – we just need people to use Google and to do a little searching before they invent / publish a software package in case it’s name is, well, already used.

QIIME workshop at UC Davis (May 2-4, 2013)

UC Davis will be hosting a 2.5 day QIIME workshop following the SMBE Satellite Meeting on Eukaryotic -Omics, running Thursday afternoon May 2nd through Saturday May 4th. Meeting participants and local Bay Area researchers are encouraged to attend.

Due to space constraints, this workshop will be strictly limited to 32 participants.

Click here to complete the application form (form closes March 22, 2013, and this deadline will not be extended)

Gordon and Betty Moore Foundation hiring fellow for Marine Microbiology program #bioinformatics

Interesting Job Opportunity: Program Fellow, Marine Microbiology Initiative – Gordon and Betty Moore Foundation

See key details of the ad below:

The Bioinformatics Fellow position will be a 1-2 year term. 
The Program Fellow will: 
  • Contribute to developing strategy and implementation plans for the bioinformatics portfolio within the Marine Microbiology Initiative.  The fellow will prepare needs assessment for cyberinfrastruture to support research and discovery by marine microbial ecologists.  The fellow will also coordinate bioinformatics-related activities within MMI. (60% time effort)  
  • Help convene, facilitate and participate in meetings about cyberinfrastructure related to the MMI community to gather and disseminate knowledge, and produce meeting reports or white papers. (30% effort)  
  • Collaborate with MMI Program Officers on grants management related to bioinformatics and data management. (10% effort)  

Key Responsibilities  
The Program Fellow will: 
  • Help develop a strategy and the implementation plans for cyberinfrastructures related to MMI activities. 
  • Communicate with the research community, other funders, commercial vendors, and others to prepare a needs analysis for cyberinfrastructure that includes a description of ongoing or past activities and existing infrastructure. 
  • Convene meetings and workshops in cooperation with grantees and other funders as necessary. 
  • Maintain solid knowledge of the field and key emerging trends.  
  • Contribute effectively on a variety of Program- and Foundation-wide issues beyond the Initiative as required. 
Experience and Education  
The candidate will have: 
  • A Doctorate degree in environmental microbiology, bioinformatics, biology or other relevant field.   
  • Demonstrated knowledge and/or experience with computing environments and sequencing technologies.   
  • Demonstrated experience with using bioinformatics tools.   
Competencies and Attributes  
The ideal candidate also will have:  
  • Good communications skills including demonstrated writing skills.  
  • Demonstrated knowledge of the bioinformatics community and/or existing cyberinfrastructure that supports environmental science.    
  • A desire to promote and work on a complex partnership and multi-stakeholder project to achieve tangible outcomes.  
  • Ability to synthesize diverse points of view to develop solutions. 
  • Demonstrated strong teamwork and interpersonal skills, with ability to develop productive relationships with colleagues, grantees, and stakeholders. Collegial and energetic working style.   
  • Demonstrated comfort with and experience in public speaking and meeting organization/facilitation.    
  • Demonstrated ability and openness to quickly adapt and adjust strategy and approach to changing conditions. 
  • Personal motivation to support the Foundation mission and goals.   
  • Ability and interest in traveling to grantee meetings, site visits, and national/international conferences.   

Attention all metagenomicists: put your pinky in the corner of your mouth & say "1 million dollars"

Already posted this to Twitter and Facebook but had to post here too.  This is wild.  DTRA has announced a $1 million prize for metagenomic analysis: DTRA Algorithm Challenge | Landing Page.  From their page

The Prize:
As nth generation DNA sequencing technology moves out of the research lab and closer to the diagnostician’s desktop, the process bottleneck will quickly become information processing. The Defense Threat Reduction Agency (DTRA) and the Department of Defense are interested in averting this logjam by fostering the development of new diagnostic algorithms capable of processing sequence data rapidly in a realistic, moderate-to-low resource setting. With this goal in mind, DTRA is sponsoring an algorithm development challenge. 

The Challenge:
Given raw sequence read data from a complex diagnostic sample, what algorithm can most rapidly and accurately characterize the sample, with the least computational overhead?

My instinct is to keep this to myself because, well, I want to win.  But my sharing side of things won out and I am posting here.  Maybe we (i..e, the community) can develop an open, collaborative project to do this?  Just a thought …

Guest post on "CHANCE" ChIP-seq QC and validation software

Guest post by Aaron Diaz from UCSF on a software package called CHANCE which is for ChIP-seq analyses.  Aaron wrote to me telling me about the software and asking if I would consider writing about it on my blog.  Not really the normal topic of my blog but it is open source and published in an open access journal and is genomicy and bioinformaticy in nature.   So I wrote back inviting him to write about it.  Here is his post:

CHANCE: A comprehensive and easy-to-use graphical software for ChIP-seq quality control and validation

Our recent paper presents CHANCE a user-friendly software for ChIP-seq QC and protocol optimization. Our user-friendly graphical software quickly estimates the strength and quality of immunoprecipitations, identifies biases, compares the user’s data with ENCODE’s large collection of published datasets, performs multi-sample normalization, checks against qPCR-validated control regions, and produces publication ready graphical reports. CHANCE can be downloaded here.

An overview of ChIP-seq: cross-
linked chromatin is sheared,
enriched for a transcription factor
or epigenetic mark of interest
using an antibody, purified and

Chromatin immunoprecipitation followed by high throughput sequencing (ChIP-seq) is a powerful tool for constructing genome wide maps of epigenetic modifications and transcription factor binding sites. Although this technology enables the study of transcriptional regulation with unprecedented scale and throughput interpreting the resulting data and knowing when to trust the data can be difficult. Also, when things go wrong it is hard to know where to start when troubleshooting. CHANCE provides a variety of tests to help debug library preparation protocols.

One of the primary uses of CHANCE is to check the strength of the IP. CHANCE produces a summary statement which will give you an estimate of the percentage of the IP reads which map DNA fragments pulled down by the antibody used for the ChIP. In addition to the size of this signal component within the IP CHANCE reports the fraction of the genome these signal reads cover, as well as the statistical significance of the genome wide percentage enrichment relative to control in the form of a q-value (positive false discovery rate). CHANCE has been trained on CHIP-seq experiments from the ENCODE repository by making over 10,000 Input to IP and Input to replicate Input comparisons. The q-value reported gives then the fraction of comparisons between Input sample techinical replicates that report an enrichment for signal in one sample compared to another equal to the user provided sample or greater. CHANCE identifies insufficient sequencing depth, PCR amplification bias in library preparation, and batch effects.

CHANCE identifies biases in sequence content and quality, as well as cell-type and laboratory-dependent biases in read density. Read-density bias reduces the statistical power to distinguish subtle but real enrichment from background noise. CHANCE visualizes base-call quality and nucleotide frequency with heat maps. Furthermore, efficient techniques borrowed from signal processing uncover biases in read density caused by sonication, chemical digestion, and library preparation.

A typical IP enrichment report.

CHANCE cross-validates enrichment with previous ChIP-qPCR results. Experimentalists frequently use ChIP-qPCR to check the enrichment of positive control regions and the background level of negative control regions in their IP DNA relative to Input DNA. It is thus important to verify whether those select regions originally checked with PCR are captured correctly in the sequencing data. CHANCE’s spot-validation tool provides a fast way to perform this verification. CHANCE also compares enrichment in the user’s experiment with enrichment in a large collection of experiments from public ChIP-seq databases.

CHANCE has a user friendly graphical interface.
How CHANCE might be used to provide feedback on protocol optimization.

Three talks, 1.5 days at #ISMB … phylogeny, phylogenomics, open science and more

Gave three talks in 1.5 days here in Long Beach as part of the satellite meetings associated with the “Intelligent Systems for Molecular Biology” (ISMB) 2012 Conference. I will write more about the meeting and the craziness of giving three very different talks in 1.5 days. But for now I wanted to at least get my talks posted here since I posted the slides to slideshare and recorded the audio in synch with the slides and posted these “slideshows” to YouTube. Here are the talks below:

Talk 1 for the “Bioinformatics Open Source Conference” BOSC2012.  Was asked to talk about Open Science … so … I did …

Slideshow with audio:

Talk 2 for the Student Council Symposium SCS2012. Sort of supposed to be a career guidance discussion so I geared my talk on the lines of “lessons learned” …

Slideshow with audio:

Talk 3 for the “Automated function prediction” AFP2012 satellite meeting.  I decided to talk about phylogenetic and phylogenomics approaches to functional prediction …

Slideshow with audio:[<a href=”” target=”_blank”>View the story “Jonathan Eisen @phylogenomics talks at #ISMB Satellite Meetings” on Storify</a>]

Summary of responses to question about metrics for density in phylogenetic trees

I posted a question to Twitter and Facebook about metrics for assessing density in a phylogenetic tree. Here is a “Storification” of the responses. Thanks for the help all.
Any other suggestions welcome in comments …[View the story “Metrics to quantify density of taxa/sampling in a phylogenetic tree” on Storify]

Draft post cleanup #11: Tree Hugging

Yet another post in my “draft blog post cleanup” series.  Here is #11 from September.

Just a quick one.  In August a nice review paper came out on phylogenetic analysis software: Learning to Become a Tree Hugger | The Scientist.  By Amy Maxmen it is a “A guide to free software for constructing and assessing species relationships”.  Definitely worth checking out.

Among the key links & tools discussed: