Workshop hands on session: Basic RNA-Seq data analysis with Tuxedo suite and ChIP-Seq data analysis with MACS
Note: This is intended as a step by step guide for doing basic RNA-seq analysis with Illumina reads using the Tuxedo suite (Tophat2/Cufflinks/cummerbund). Due to the time and computation resource constraints at the workshop, a subset of real data is used for the RNA-seq data analysis. Therefore the results here are only for demonstration of workflow purpose. We are only using the default parameter setting. Pleases note that COPY & PASTE the commands may not work depending the computer you are using (font change in Linux). It is best to type the commands yourself. For tophat/cufflinks runs, please get on a compute node.
For details of parameter setting with Tophat/Cufflinks/cummeRbund, refer to:
Tophat2 manual: http://ccb.jhu.edu/software/tophat/manual.shtml
Cufflinks manual: http://cole-trapnell-lab.github.io/cufflinks/manual/
Bowtie2 manual: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
Samtools manual: http://samtools.sourceforge.net/samtools.shtml
CummeRbund manual: http://compbio.mit.edu/cummeRbund/manual_2_0.html
For wget command, if you are not on a log in node, please change “hpc.oit.uci.edu” in the web links to login-1-2 .
1. Get reference genome: We are using Yeast as our organism here because of its small genome size. Download Yeast reference genome from UCSC genome browser download site (see class slides) to your computer. An easier alternative is to us wget command on HPC for a direct download that eliminate the transfer from your local computer:
$ wget http://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/bigZips/chromFa.tar.gz
This should download into your current directory a zipped version of yeast genome sequences in FASTA format for each of the chromosomes. You can use ls to see it. If you are working with model organisms such as human or mouse, refer to class slides.
$ gunzip chromFa.tar.gz
$ tar -xvf chromFa.tar
Gunzip and tar should uncompress your downloaded file into 17 FASTA files for 17 chromosomes. Concatenate the17 FASTA files into one file as your reference genome using cat command. This way you do not have to index your 17 files individually. *.fa lists all files that end with .fa and > directs the output to a file called ref.fa. You can also use your own naming system. Just keep consistent with the names.
$ cat *.fa > ref.fa
2. Index reference genome: Use bowtie2 to index your reference genome ref.fa for alignment. Here –f indicates that your file is in FASTA format. ref is the prefix of the 6 output index files.
$ module load bowtie2/2.2.3
$ bowtie2-build -f ref.fa ref
This will take a couple of minutes. For mammalian genomes, it can take hours. After indexing, you should have six files of indexes now: ref.1-4.bt2 and ref.rev.1/2.bt2. Bowtie2 indexes the genome with a Burrows-Wheeler index to keep its memory footprint small. If you are interested in why you need to index your reference before mapping and what the .bt2 files mean, refer to
http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
3. Get annotation: Download Yeast reference genome annotation from UCSC genome browser table view (see class slides). To reduce the run time, download a sample annotation file in GTF format were prepared for the tutorial. This is a small subset of the available annotation so that your Tophat run will not take too long.
$ wget http://hpc.oit.uci.edu/biolinux/ref/ref.gtf
4. Get raw reads: Download the sample read files for condition 1 and 2, which are a small subset of the real files. It will take a few minutes depending on your internet speed. You should be able to use ls and less to see them in your current directory. Notice it takes a couple of minutes to download just 8-9 million reads.
$ wget http://hpc.oit.uci.edu/biolinux/reads/s1.fastq
$ wget http://hpc.oit.uci.edu/biolinux/reads/s2.fastq
5. Short read alignment: Run Tophat2 with the download reads: s1.fastq for condition 1 and s2.fastq for condition 2 respectively. Here we use single read design to save run time. This should run much faster than a normal Tophat run, which could take hours to days. -T forces Tophat to only align the reads to the annotated transcriptome to save running time. –o specifies the output folder, c1 for condition1 and c2 for condition 2. –G specifies your annotation file. ref indicates the prefix of your 6 bowtie index files. You also need to load samtools and tophat in order to run tophat.
$ module load samtools/0.1.19
$ module load tophat/2.0.12
$ tophat2 -T -G ref.gtf -o c1 ref s1.fastq &
$ tophat2 -T -G ref.gtf -o c2 ref s2.fastq &
Tophat run will take quite some time. You do not have to wait for the two Tophat runs to finish in order to move on to the next step!
6. Prepare mapping for visualization: Index and sort the 2 Tophat mappings using samtools command so the mapping can be imported to Integrative Genomics Viewer (IGV) for visualization. First get the prepared BAM files instead of waiting for your Tophat runs to finish using wget. These two files should be the same as your tophat run output: c1/accepted_hits.bam and c2/accepted_hits.bam. You also need to sort and index your BAM files so IGV can quickly find the desired genomics location for visualization.
$ wget http://hpc.oit.uci.edu/biolinux/tophat/s1.bam
$ wget http://hpc.oit.uci.edu/biolinux/tophat/s2.bam
$ samtools sort s1.bam s1.sorted
$ samtools sort s2.bam s2.sorted
$ samtools index s1.sorted.bam
$ samtools index s2.sorted.bam
7. Visualization: Visualize your indexed and sorted mapping in Integrative Genomics Viewer.
$ module load igv
$ igv &
Click File->load from file and choose s1.sorted.bam. Use the yeast genome SacCer2 as reference genome and go to chrIII:167,540-167,956 (you can type it in the search box) and you should see your mapped reads! Load s2.sorted.bam in the same way. And both s1 and s2 have one data track in your window. Mouse over and right click to change color scheme etc. Change your coordinates to wherever you like.
8. Expression Quantification: Run cufflinks on the Tophat resulted BAM files to assemble transcripts and estimate abundance for each sample. –o specifies cufflinks output folder as cf1 for condition1 and cf2 for condition 2:
$ module load cufflinks
$ cufflinks -o cf1 s1.sorted.bam &
$ cufflinks -o cf2 s2.sorted.bam &
& means send the job to background. You should have two folders cf1 and cf2 for cufflinks output when you are done. Use
$ ls -ltr cf1
to see what is inside.
The file transcripts.gtf is your assembled transcripts. Genes.fpkm_tracking and isoforms.fpkm_tracking lists the FPKM values with its corresponding transcripts/genes ID and genomic location.
9. Consolidate assembled transcriptome: Use cuffmerge to combine your assembled transcripts from all samples. The manifest.txt contains the path to the two assembled transcripts.gtf files. Use less command to take a look.
$ wget http://hpc.oit.uci.edu/biolinux/manifest.txt
$ less manifest.txt
$ cuffmerge -s ref.fa manifest.txt
When the run is finished, you should have a folder generated by cuffmerge called merge_asm when you are done. There is a merged.gtf file in this folder which is your merged GTF format transcripts.
10. Differential expression analysis: Use cuffdiff on the two Tophat mapping files to do differential expression analysis between condition 1 and 2. –o specifies the output directory of cuffdiff. –L gives labels to the 2 conditions as c1 and c2. –-FDR sets the false discovery rate threshold to 0.05. merged_asm/merged.gtf is the merged transcriptome annotation you generated in step 9. Note that this test data set does not have biological replicates. In reality, you usually will have a few replicates in each condition. Just list them in the manifest file.
$ cuffdiff -o cd_1_2 -L c1,c2 --FDR=0.05 merged_asm/merged.gtf s1.sorted.bam s2.sorted.bam
When cuffdiff is done, you should have a folder called cd_1_2 with your cuffdiff results.
$ ls -ltr cd_1_2
This will show you the results list. There are around 22 files in the folder and you should be mostly interested in those .diff files such as gene_exp.diff. These files are text files and can be directly imported to excel.
$ less cd_1_2/gene_exp.diff
For details of cuffdiff output, see http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/index.html#cuffdiff-output-files
------
For those of you with R basic knowledge or want more challenges, use the output results from cuffdiff you generated earlier (in the cd_1_2 folder) and follow the protocol here. A line-by-line instruction is included to show you how to generate graphs of the two samples. For details, see
http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html
You need to load R first. You will also need x2go for graphics if you are using a windows laptop. Make sure you are in the directory where “cd_1_2” is. Change directory using cd.
$ module load R/2.15.2
$ R
library(cummeRbund)
cuff_data <- readCufflinks('cd_1_2')
Depending on the configuration on HPC, you might need to use the absolute path in the above command.
csDensity(genes(cuff_data))
csScatter(genes(cuff_data), 'c1', 'c2')
csVolcano(genes(cuff_data), 'c1', 'c2')
Quit R using quit().
For those of you who are interested in ChIP-Seq data analysis and have extra time at the tutorial session, here is some sample data published by UCI researchers (SPEBP-1 ChIP-Seq Data)
https://cbcl.ics.uci.edu/doku.php/data
Both the raw reads and processed data are accessible at this web. The paper that describes the dataset is here if you are interested:
http://www.pnas.org/content/106/33/13765.long
Download the raw data from the first link using wget on HPC (this will take a few minutes)
$ wget https://cbcl.ics.uci.edu//public_data/SREBP1/raw_data/IgG_lane8_eland_multi.txt
$ wget https://cbcl.ics.uci.edu//public_data/SREBP1/raw_data/srebp1_lane7_eland_multi.txt
Run MACS on HPC now. –g mm indicates the genome size is mouse. –t indicates the treatment file is srebp1_lane7_eland_multi.txt. –c gives the control file is IgG_lane8_eland_multi.txt. Use the perl command line to remove the “ref_” in front of the chromosome names so they are compatible with IGV genome names.
$ module load enthought_python/7.3.2
$ module load macs
$ macs -t srebp1_lane7_eland_multi.txt -c IgG_lane8_eland_multi.txt -g mm
$ perl -anle 's/ref_//g; print $_' NA_peaks.bed > NA_peaks1.bed
Your results are in the current folder. Use ls and less command to check them out! The BED files NA_peaks1.bed can be directly imported to IGV for visualization. Choose the mouse genome mm8 build. Check it out yourself!
For those of you with extra time and are up for a challenge, try writing a shell script that includes commands from section 1 to 10 and submit it the HPC Grid Engine using qsub command from the Linux session. Does your job run? This is your first RNA-seq pipeline!