Spring 2014
BIOL 312: Microbiology
A Town on Fire
Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire /
Instructor: Dr. Tammy Tobin Susquehanna University
E-Mail:

Overview

In 1962, a surface trash fire ignited an anthracite coal seam in an abandoned strip mine in Centralia, Pennsylvania. Repeated efforts to extinguish the fire failed, and in 1984 Congress responded to the resulting high carbon monoxide levels and frequent land collapses by allocating more than$42 million for relocation efforts. Most of the residents have long since moved, and their homes have been demolished, leaving behind a ghost town where a coal mining community once thrived (Fig 1).

Figure 1: Above: Centralia, PA prior to the evacuation in 1984. The town had over 1800 residents, several businesses and churches. Right: Old Route 61 through Centralia (taken in 1997) showing steam, rich in carbon monoxide, venting upward through cracks caused by land collapses.

As a result of this mine fire, surface soil temperatures in affected areas regularly exceed 60°C and soils surrounding the vents are often rich in combustion products such as sulfur and nitrogen that microbial communities can use and transform as a part of their energy-generating processes.

In this case study, you will use information in papers that describe typical geothermal soils and their microbial communities to hypothesize a single bacterial genus that you would expect to find living in Centralia’s fire-affected soils. You will use metagenomic analysis to test your hypothesis and then make a presentation that reports your findings and predicts the types of impacts that members of your genus might be having on the Centralia ecosystem.

Goals

As a result of participation in these activities, students will be able to:
1.  Explain each step in the generation and analysis of Next Gen metagenomic 16S rRNA sequence data.
2.  Discuss the basic biology assumptions that underlie sequence analysis (e.g. evolution, structure and function, conservation = function).
3.  Evaluate the strengths and weaknesses of the methods employed in Next Gen sequencing, including the impact that data quality has on bioinformatics analysis.
4.  Choose and justify the appropriate methods for a specific Next Gen sequencing application.
5.  Apply Next Gen sequencing methodologies to solve their own research questions.

Evaluation

The final evaluation of this project will be based on the successful completion of Team Application Activities and the Final Presentation.

Figure 2: Steam from “Anthracite Smokers” in Centralia, PA carries dissolved combustion products, such as nitrogen and sulfur, to the surface through soil fractures. As the steam rises it cools and precipitates chemicals into the surrounding soils where they can be utilized and transformed by nitrogen and sulfur-cycling bacterial communities. /

Materials

Recommended Readings:
A Metagenomics Primer
Computer Resources:
Quantitative Insights into Microbial Ecology (QIIME) for Macs.
Students may also use the Windows version of QIIME, but must also install Virtual Box to run the program.
Installation instructions for both platforms can be found at the QIIME website at: http://qiime.org/
Metagenomics Sequence Resources:
Centralia Metagenomics files Cen95 and Cen125 are available through the GCAT-SEEK consortium at http://lycofs01.lycoming.edu/~gcat-seek/index.html

Team Application Activities:

Activity #1
Students will learn about the history and biogeochemistry of the Centralia Mine Fire environment and will take the GCAT SEEK pre-test.
Activity #2
Students will work in teams in order to familiarize themselves with metagenomics, LINUX and QIIME, and will propose hypotheses regarding the types of microbial species they expect to see in thermophilic versus mesophilic soils in Centralia.
Activity #3
Students will use QIIME to test their hypotheses.
Activity #4
Students will complete their QIIME analysis and begin to prepare their presentations.
Final Presentation
Each student team will present their metagenomic findings.

Team Application Activity #1: An Introduction to Next Generation Sequencing, Metagenomic Analysis, LINUX and QIIME

Next Generation Sequencing and Pyrosequencing

“Next generation (Next Gen) sequencing” is a term that encompasses a variety of DNA sequencing technologies, all of which have a common core approach: they use DNA polymerase to generate thousands or millions of relatively short (compared to traditional sequencing technologies) sequences of a DNA template concurrently. Thus, these sequencing technologies are often referred to as being ‘massively parallel’. They then differ in the manner in which they determine when (and which) base is added to the replicating DNA (that is, in how they actually “read” the sequence). For example, Ion Torrent sequencing uses the tiny pH change that happens each time a new phosphodiester bond is created to determine whether or not a particular base was added.

The data that we will be using was generated using a technology called ‘pyrosequencing’ (Figure 3). In this technology, a library is first made by either fractionating genomic DNA into smaller fragments (300-800 base pairs) or, as in our case, a specific gene from an environment can be amplified using PCR, and all of the copies of that gene serve as the template DNA. Short adaptors (shown as A and B in the figure) are then ligated onto the ends of the template fragments. The first adaptor is used to attach the DNA fragments onto streptavidin-coated beads. The second primer is used for amplification and sequencing of the fragments. The DNA library is then treated to make it single-stranded, and immobilized onto the beads at a dilution that ensures that each bead contains only a single, unique DNA fragment (top row, left hand side)

The bead-bound library is emulsified with PCR reagents in a water-in-oil mixture. Each bead is captured within its own microreactor where PCR amplification occurs. This results in approximately 10 million copies of a single sequence (top row) attached to each bead.

The beads, containing the amplified, single-stranded, template DNA library, are then added to individual wells of a PicoTiterPlate (center row) that contains the DNA polymerase, sulfurylase and luciferase enzymes. The latter two enzymes will make a flash of light if DNA polymerase successfully adds a base to the growing end of a daughter DNA strand during the sequencing reactions. (bottom row).

The loaded PicoTiterPlate device is placed into the sequencer, which floods all of the wells (each well, as you remember, has a different DNA fragment in it) with sequencing reagents containing buffers, primers and one of the bases. Let’s say G is added to all of the wells first. Since each well has a unique piece of DNA, the G will be complimentary to the first base of the template DNA in some (but not all) of the wells. Thus, DNA polymerase will only add it to the growing daughter strand in those (complementary) wells. Multiple G’s will be added at this time if the template strand has more than one C in a row (e.g. CCC in positions 1, 2 and 3). Addition of one or more nucleotide(s) generates a flash of light, as previously described. The signal strength of the flash is proportional to the number of nucleotides added, so a GGG sequence will have a light signal three times as bright as a single G. If the base that is added is not complimentary to the template strand no light will be generated.

When an entire plate is flooded with the sequencing reagents in this manner, some of the wells will glow and some will not. The sequencer can detect the light flashes, and will record which of the wells incorporated a G. The wells are then washed, the next base is added (either A, T or C) and the whole process is repeated, sequentially. After each addition, the sequencer ‘reads’ which wells incorporated the new base. This process is then repeated many times, ultimately generating short (up to several hundred base pair) sequences of all of the unique fragments in all of the wells at the same time…massively parallel, indeed!

Metagenomic Analysis of Bacterial 16S rRNA genes (Much of this content was pirated shamelessly (but with permission) from Regina Lamandella, Juniata College)

The term ‘metagneomics’ was originally coined by Jo Handelsman in the late 1990s and is currently defined as “the application of modern genomics techniques to the study of microbial communities directly in their natural environments”. Metagenomics analyses allow microbiologists to tap into the vast, uncultured/unculturable microbial diversity of our world. Recently, massively parallel next generation sequencing has become cost-effective and informative, allowing taxonomic profiling of microbial communities, and leading to consortia such as the Earth Microbiome Project, the Hospital Microbiome Project, the Human Microbiome Project and others that are tasked with uncovering the distribution of microorganisms within us and in our world.

The rRNA operon contains genes that encode structural and functional portions of the ribosome (Figure 4). This operon contains both highly conserved sequences that can be used to design ‘universal’ and taxon-specific PCR primers and highly variable regions that simultaneously allow researchers to distinguish between taxa. Within this operon, the small subunit RNA (16S rRNA) gene has been particularly valuable for phylogenetic analysis. A vast amount of sequence data for this gene exists in a variety of international databases, and this data can be used to design phylogenetically conserved probes that target both individual and closely related groups of microorganisms without cultivation. Some of the most well curated databases of 16S rRNA sequences include Greengenes, the Ribosomal Database Project, and ARB-Silva (see references section for links to these databases).

Figure 4. Structure of the rRNA operon in bacteria. Figure from Principles of Biochemistry 4th Edition Pearson Prentice Hall Inc. 2006.

In preparation for this case study, soil was collected from 3 boreholes in Centralia, PA (37°C, 52°C and 60°C), and genomic DNA was directly isolated from the samples using the MoBio Powersoil Kit. PCR with universal bacterial16S rRNA primers was then used to make copies of all of the bacterial 16S rRNA genes in each of these samples. These PCR products were then used as the template for Roche 454 pyrosequencing at the Penn State University genomics lab. You will be using this data to test hypotheses regarding the types of bacteria that live in the hot soils overlying the Centralia, PA mine fire. But first, you must learn a bit about the program that you will be using to perform the analyses.

An Introduction to MacQIIME

Quantitative Insights into Microbial Ecology (QIIME) is an open source pipeline that runs in a LINUX environment. It can be used to process next generation sequencing data in a variety of ways that range from making sure that all of your sequences are of high enough quality to be used (quality trimming), to performing a whole suite of phylogenetic and statistical analyses on the quality trimmed data. We will be utilizing many of these functions in this case study, but first you must get used to working in the LINUX environment using the Mac Terminal, which is part of the operating systems on all Macs. It has been pre-loaded with MacQIIME, so it should be ready to go. The Linux and QIIME tutorials that follow are largely the work of Dr. Regina Lamandella at Juniata College. I have tweaked them a bit to be appropriate for our operating system and case study.

Unix/Linux Tutorial

Linux is an open-source Unix-like operating system. It allows the user considerable flexibility and control over the computer by command line interaction. Many bioinformatics pipelines are built for Unix/Linux environment; therefore it is a good idea to become familiar with Linux basics before beginning bioinformatics.

Every desktop computer uses an operating system. The most popular operating systems in use today are Windows, Mac OS, and UNIX. Linux is an operating system very much like UNIX, and it has become very popular over the last several years. Operating systems are computer programs. An operating system is the first piece of software that the computer executes when you turn the machine on. The operating system loads itself into memory and begins managing the resources available on the computer. It then provides those resources to other applications that the user wants to execute.

The shell- The shell acts as an interface between the user and the kernel. When a user logs in, the login program checks the username and password, and then starts another program called the shell. The shell is a command line interpreter (CLI). It interprets the commands the user types in and arranges for them to be carried out. The commands are themselves programs: when they terminate, the shell gives the user another prompt ($ on our systems).

Filename Completion - By typing part of the name of a command, filename or directory and pressing the [Tab] key, the shell will complete the rest of the name automatically. If the shell finds more than one name beginning with those letters you have typed, it will pause, prompting you to type a few more letters before pressing the tab key again.

History - The shell keeps a list of the commands you have typed in. If you need to repeat a command, use the cursor keys to scroll up and down the list or type “history” for a list of previous commands.

Files and Processes

Everything in UNIX is either a file or a process.

A process is an executing program identified by a unique process identifier. A file is a collection of data. They are created by users using text editors, running compilers etc.

Examples of files:

·  A document (report, essay etc.)

·  The text of a program written in some high-level programming language

·  Instructions comprehensible directly to the machine and incomprehensible to a casual user, for example, a collection of binary digits (an executable or binary file);

·  A directory, containing information about its contents, which may be a mixture of other directories (subdirectories) and ordinary files.

It is not required to have a Linux operating system to use QIIME. We will be running the Linux environment through the Mac Terminal. So first things first:

Team Application Activity # 1: Practicing with LINUX and MacQIIME

Names of Team Members:

Part One: Practicing with the LINUX environment