I found myself writing this email to some collaborators, but halfway through realized that it’d be nice to get EVERYBODY’s input. Probably, one of you is going to review my next paper, so how awesome would it be for you to just tell me what you think now, and make both of our lives easier later.
To test whether taxa vary significantly across groups of samples, we first need to filter the OTU table to get rid of OTUs that are not present in most of the samples and/or that do not vary across samples. This must happen for statistical reasons.
As far as I know, there are two ways to do this. One, is to remove OTUs that occur in fewer than 25% of the samples (25% is suggested by the QIIME folks). The other is to calculate the variance of the OTUs across samples and remove the OTUs that have a variance less than 0.00001 (0.00001 is an arbitrary number thrown out there by the phyloseq developer.)
A third option would be to apply both criteria.
My inclination would be to go with the third option, but mostly because I want to limit as much as possible the number of hypothesis tests that we do in order to avoid draconian p-value corrections.
I’m not a big fan of arbitrary thresholds, but they are so frequently required that I’ve made my peace with them. However, if someone can suggest a non-arbitrary threshold, that’d be great.
But mostly, I want to make sure that everyone agrees now on the method that we use so that I only have to do this once. Thoughts?