Advice needed from a future reviewer…

I found myself writing this email to some collaborators, but halfway through realized that it’d be nice to get EVERYBODY’s input. Probably, one of you is going to review my next paper, so how awesome would it be for you to just tell me what you think now, and make both of our lives easier later.

To test whether taxa vary significantly across groups of samples, we first need to filter the OTU table to get rid of OTUs that are not present in most of the samples and/or that do not vary across samples. This must happen for statistical reasons.

As far as I know, there are two ways to do this. One, is to remove OTUs that occur in fewer than 25% of the samples (25% is suggested by the QIIME folks). The other is to calculate the variance of the OTUs across samples and remove the OTUs that have a variance less than 0.00001 (0.00001 is an arbitrary number thrown out there by the phyloseq developer.)

A third option would be to apply both criteria.

My inclination would be to go with the third option, but mostly because I want to limit as much as possible the number of hypothesis tests that we do in order to avoid draconian p-value corrections.

I’m not a big fan of arbitrary thresholds, but they are so frequently required that I’ve made my peace with them. However, if someone can suggest a non-arbitrary threshold, that’d be great.

But mostly, I want to make sure that everyone agrees now on the method that we use so that I only have to do this once. Thoughts?

installing STAMP on a Mac

Since this was such a huge pain in the ass for so many of us, I figured I’d share what finally worked for me.

First this:
pip install STAMP

then I got an error about matplotlib, so then this:
pip install matplotlib

Now, I type STAMP and it launches.

Of course, I did a hundred other things before I tried this, any number of which may or may not have contributed to the ease of this solution. But, if you’re still trying to get STAMP installed, give this a try.

“pip install STAMP” was a suggestion by Tracy Teal, btw. Can’t wait for her and Titus to get to Davis!

What the fungi do I do with my ITS library? (Part 2)

What the fungi do I do with my ITS library (Part 2)
Originally posted on on May 22, 2014

Previously, I expressed some concern about size variation in my environmental fungal ITS PCR libraries. I’m still concerned about that, but I have an additional concern. The ITS region can’t be aligned, and I’m partial to phylogenetic approaches to pretty much everything. So maybe ITS is not for me?

So, I asked Twitter again…

In summary, I don’t think that I can use ITS given the size variation that I see, and I’m not sure that I want to, given the fact that you cannot align it to do phylogeny-based analyses.

28S (or LSU) is a reasonable alternative to ITS that has two big downsides: 1) the reference database is much smaller than the ITS reference database and 2) it does not provide the fine-scale taxonomic resolution that ITS does.

Rachel Adams referred me to Amend et al, in which they use both. I’ll have to look into this approach…

What the fungi do I do with my ITS library?

Originally posted on May 21, 2014

It’s been about 8 years since I started working on my first 16S rRNA PCR survey (of Drosophila gut microbes). At that time, I was occasionally asked, “what about Archaea or what about microbial Eukayrotes?” Then, and ever since, my reply has been that it’s hard enough to get a handle on what’s going on with the bacteria – I don’t need to make my life more challenging by broadening my scope.

But, finally, this month, I’m making my life more challenging. As part of my new Seagrass Microbiome Project, I’ve decided to tackle the fungi. As far as I can tell, ITS is the “barcoding” marker of choice for fungal types. For many reasons, it’s best to follow the herd when doing this sort of thing: 1) someone else has already designed, tested, and published results with these primers, 2) there is a reasonably large database of ITS sequences available to compare my sequences to, and 3) I lack the interest and personnel to explore an alternative approach.

So, I just plunged right in. At first, I tried some new primers designed by Nick Bokulich, but he warned me that they were “finicky” and he was correct. I got no amplification with my seagrass samples, and the positive control I had only worked about half the time. I know some other fungi people, well, I know Jason Stajich (@hyphaltip), so I asked him which primers I should use, and I decided to go with the primers set used in a cool paper by Noah Fierer’s lab, in which they looked at fungi in rooftop gardens in New York City.

Those worked, and a few days ago I got word that my sequencing run was in. It looks like crap. We typically get about 12 million sequences from our MiSeq runs, but this time, I only got 4 million. I was also told that the reverse reads looked much better than the forward reads.

So, now, in addition to working with a new “barcode,” I have to troubleshoot a crappy sequencing run. In many ways, it’s nice to have undergrads and a technician in the lab who do all of my lab work for me these days, but it sucks when it’s time to troubleshoot because I’m so far removed from the bench that I have no idea what’s going on anymore.

So, the first thing I asked for was the Bioanalyzer trace that’s always run before the library goes on the machine. It looks like this:

Bioanalyzer trace for my first fungal ITS MiSeq run

I had been told that there was size variation. I had even seen some of the PCR gels. But, still this is not what I expected to see. Upon seeing this, I am concerned about two things. 1) If there is strong preferential amplification of smaller DNA molecules during the bridge PCR on the flow cell, then will I even see DNA from those larger peaks? 2) With our 300bp reads, for sure the amplicons in the peaks <400 will have overlapping forward and reverse reads, but for sure the 676bp amplicons will not. What effect will these two things have on my analysis? How do I accommodate this size variation? One of the reasons to follow the herd with these methods is that other people have probably already encountered and dealt with exactly this issue, so I turned to Twitter…

There are some great resources suggested here. I know what I’ll be reading this weekend…


Valentine’s Day Seminar at #UCDavis: Herbert M Sauro: Reproducibility in Systems and Synthetic Biology: Issues at the bench and the computer.

The Genome Center Biological Networks Seminars Series present:

Reproducibility in Systems and Synthetic Biology: Issues at the bench and the computer.

Speaker: Herbert M Sauro
Associate Professor
Department of Bioengineering

University of Washington
Date: Friday, February 14th, 2014, 10am – 11pm
Location:  1005 GBSF

Reproducibility has been and is becoming more of an issue as the research we do becomes more complex. In the work I do there are two areas that warrant concern. The first is that the computational experiments we publish as a community are rarely if ever reproducible. Secondly, in synthetic biology where we design new organisms which are are also published we again are confronted with the fact that the bulk of published synthetic biology designs can not be recreated without recourse to the original constructs themselves. Sometimes even then the reported experiments cannot be reproduced. Reproducibility is at the heart of the scientific method and it damages science, particularly in the eyes of the general public, if the work we do cannot be easily reproduced. In addition, there are cost concerns when it can take months of labor to recreate work already done. The good news is that, at least in computational science, the reason for the lack of reproducibility is due almost entirely to human error. This is likely to true in experimental science as well. Human error can in theory be easily corrected. In this talk I will discuss some of the efforts going in my lab and others in relation to reproducibility in computational modeling and the design and implementing of synthetic organisms.

NASA Data Bridge Workshop

I’m at Caltech right now, at a workshop designed to figure out how to use citizen scientists to bridge the gap between NASA Earth Science data and societal needs (policy, management, forecasting, etc.) I’ll write up a more detailed summary later, but I wanted to post the bios of the participants. It’s a pretty awesome crowd.

NASA Data Bridgemakers Bios

Citizen (Funded) Science

As evidenced by the buzz around the Citizen Science session at ASM yesterday, microbial ecologists are increasingly looking for ways to engage “Citizen Scientists” to participate in large-scale projects, often by crowd-sourcing sample collection. This concept is not new; in fact, citizen scientists have been counting birds every year for over 100 years!

A recent price drop in high-throughput sequencing technology has enabled the ginormous scale of current citizen science projects. Anyone who is interested can participate by swabbing their homes or their poop! This same price drop also enables a different sort of citizen scientist. This sort of citizen scientist is not tied to any pre-existing project, but is free to imagine his own. If he can convince the right person that his project is interesting, he can fund it himself with relatively little cash.

Dan Prater is exactly this sort of citizen scientist! He was curious about the microbial succession taking place in the compost tea that he was brewing on his family’s organic farm in Indiana. So, he asked Jonathan Eisen to help him figure it out! Dan collected a time series of samples from two different batches of compost tea. He shipped the samples to UC Davis, and I prepared 16S rRNA PCR libraries from them. We went from sample collection, through a MiSeq run, to a poster at ASM (with data) in a month! <sound of my horn tooting>

If you are at ASM, you can check out this poster tomorrow, Tuesday, May 21 from 1:00-2:45pm! It’s poster #2402, entitled Phylogenetic Analysis of and Actively Aerated Compost Tea.

UC Davis’ 2nd Annual BioBlitz will be sampling MICROBES!

The first-ever MicroBioBlitz will be held tomorrow, April 27th, from 9am-2pm! Join the Eisen Lab and the UC Davis chapter of SEEDS (the Ecological Society of America’s Undergraduate Ecology Club) at the Putah Creek Riparian Reserve for their second annual BioBlitz. The event is open to the public, and SEEDS just finished putting out some traps, so there could be lots of fun critters to see tomorrow!

I just chatted with the BioBlitz organizers, and they are super-excited to have us there to recruit the BioBlitz participants. As they catch frogs and butterflies, and identify flowers and trees, they will all be armed with a microbial sampling kit! Microbial samples will be sent to Jack Gilbert for sequencing as part of the Earth Microbiome Project!

Bring friends, family, co workers, or anyone else you might know!

Here is the link to iNaturalist, which is the site they will be using to upload all of their species accounts. Come out and help the UCD SEEDS chapter win the BIOBLITZ contest this year!!

Hmmm… wonder if our microbial species will count for the contest???

Here’s the official BIOBLITZ FLYER 

and a BioBlitz Map

Annotation Databases

MG-RAST allows you to view the annotation of your data using several different annotation pipelines/databases. So, we had a discussion about them. Each database/tool was tackled by a different person:

1. GenBank/RefSeq – Joe
2. SEED/Subsystems – Jenna
3. COG/NOG/eggNOG – Tyler
4. KEGG/KO – Megan
5. SwissProt/trEMBL – Kate
6. IMG – Guillaume
7. PATRIC – Sima
8. GO – David

I’m hoping that everyone will be so kind as to post a summary of their database here, as a reply to this blog post.

We kept coming back to the point that which database is right for you depends on what biological question you are hoping to address. As a test dataset, we are currently using samples of a microbial mat from Lake Frxell in Antarctica. Kate, Tyler, and Megan will provide us with a few interesting questions that we might be able to address using their data, and then we will all spend some time playing around with the annotation results from the different databases. How does the biological interpretation of the data change with respect to the annotation database used? Next week, we will discuss this.