Data, Data, Data!

It’s been a while since our last post but we finally (after many technical challenges and more than our share of bad luck) have data and are in the process of analyzing it!

I’ve worked in other labs before this one and I always forget how much fun it is to finally take a step back and analyze the data you’ve been working so hard to obtain. I thought I’d share some of what we’re up to with you.

Demultiplexing and Assembly

The data generated by the Illumina sequencing takes the form of many thousands of short reads (about 600bp each). The sequencer also performs some preliminary error-checking and clean-up on the reads so the sequence is easier to work with. Since we pooled our samples into one well, our first step was to separate each set of reads by barcode, this is also called demultiplexing the data.

Unfortunately, none of my reads showed up in the demultiplexed data from the sequencer but when we went back and re-ran the demultiplexer on the raw, pre-processed data we found that the THU reads were present but had been thrown out as errors because the reads had the barcode previously assigned to Amanda (whose library was not being sequenced in this particular sequencing run). We concluded that this was most likely due to a mix-up during the library preparation process and later we verified the reads were truly THU using a whole-genome BLAST.

After demultiplexing, we used an assembly pipeline called the A5 pipeline (a piece of software developed in the Eisen lab) to assemble the reads into contigs and then scaffolds. Contigs are small sections of DNA that have been compiled by aligning reads next to each other using overlapping regions as a guide. Scaffolds are even larger aligned sections of DNA that are made up of contigs. (Nature Education has a helpful diagram here: http://www.nature.com/scitable/content/anatomy-of-whole-genome-assembly-20429)

Annotation

Once the draft genome was assembled into scaffolds we submitted the scaffold data to RAST, a genome annotator. Genome annotation software, such as RAST searches submitted sequences of DNA to identify known genes and gene families in the sequence. It also has a tool for comparing genomes to each other. Below is a summary of the RAST annotation of my organism.

I still have a lot of analyzing left to do, but it’s wonderful to finally be at this step!

Author: Hannah Holland-Moritz

Hannah Holland-Moritz is a graduate student working in Noah Fierer’s lab. She graduated from UC Davis in June 2014 with a major in Biochemistry and Molecular Biology and minor in Bioinformatics and most recently spend a gap year working in Jonathan Eisen’s lab on the microbiome of seagrasses. Interested in Evolution, Ecology, Bioinformatics and all things microbial, she plans to pursue a career in research. View all posts by Hannah Holland-Moritz

2 thoughts on “Data, Data, Data!”

Nice post Hannah. Just a couple of technical comments. Illumina produces millions of ~160 bp reads on the MiSeq we used. The fragments containing the adapters range from 300-600 bp.

LikeLike