Identifying Sequence Variants with GATK

Computational Biology and Genomics Workshop

Todos Santos Center

April 18-22, 2016

Identifying sequence variants with GATK

Background. Identifying DNA sequence polymorphisms within and among individuals is becoming one of the most common applications of next-generation sequencing technologies. Although conceptually simple, these analyses present numerous technical challenges associated with distinguishing real biological variants from sequence artifacts. The Genome Analysis Toolkit (GATK) produced by the Broad Institute is one of the most flexible and widely used tools for identifying single nucleotide polymorphisms (SNPs) and small insertions and deletions (indels). These tools can be used with diverse types of data, including genomic DNA sequences, RNA-seq, pooled samples, etc.

Objectives. The goal of this exercise will be to use a sample of reads corresponding to a single portion of chromosome 10 in the human genome and identify SNPs relative to a human genome reference sequences. It will include multiple steps from the GATK tutorials to illustrate some of the technical consideration required when mapping reads to a reference genome and identifying sequence variants.

Software and Dependencies

GATK

Picard Tools
Burrows-Wheeler Aligner (bwa)
Integrated Genomics Viewer (IGV)

The bwa executables should be installed and in your PATH. GATK and Picard tools are java-based jar files and, therefore, cannot be added to your PATH as executables. The code below assumes that the GenomeAnalysisTK.jar and picard.jar files should be saved as environmental variables $GATK and $PICARD.

Protocol

Screen Illumina reads for adapter sequences.Our starting file for this exercise is unmapped.bam. This file contains raw Illumina sequence reads from genomic DNA of a single human individual, but they have filtered to include only those that map to small snippet of chromosome 10. Note that this file is not in the familiar FASTQ format. It is a BAM file. The SAM/BAM format is widely used in analyzing sequence variants, and we typically think of these files as summarizing reads that have been mapped to a reference sequence (more on that later). But they can also summarize raw unmapped reads, as in this case.

Open a Terminal session and enter the following command:

cd ~/TodosSantos/gatk

Now, let’s start by using a feature from Picard Tools to screen our reads from adapter sequences that are an artifact from the library construction process. If not trimmed out, these sequences can be mistaken for real sequence variants. The following command will compare reads to a default library of Illumina adapter sequences, but custom sequences that were part of your specific library protocols can also be specified. Enter the following command (all as one line).

java -jar $PICARD MarkIlluminaAdapters I=unmapped.bam O=unmapped.markedadapters.bam M=markilluminaadapters_metrics.txt

Remember $PICARD is just an environmental variable that has been set to refer to the location of the picard.jar file from Picard Tools.MarkIlluminaAdapters is the specific tool within Picard Tools that is being called. The I, O, and M options specify the input BAM file, modified output BAM file, and metrics file, respectively.

Convert marked BAM file back to FASTQ file. Enter the following command to convert the new BAM file into a FASTQ file that can be used for mapping to a reference genome.

java -jar $PICARD SamToFastq I=unmapped.markedadapters.bam FASTQ=markedadapters.fq CLIPPING_ATTRIBUTE=XT CLIPPING_ACTION=2 INTERLEAVE=true

The output here is a single FASTQ file (markedadapters.fq) with the read 1 and read 2 sequences interleaved (i.e., “shuffled”) with sequences containing adapter sequences marked for clipping.

Map reads to the human reference genome. For this step we are going to use a human reference genome (human_g1k_v37_decoy.fasta) found in the human_bwa_ref directory. Note that this FASTA file has already been used to create index and dictionary files for mapping and downstream analyses (using the “bwa index” function, the Picard Tools “CreateSequenceDictionary” function, and the “samtools faidx” function).

Map the reads in the FASTQ file generated above with the following command (all on one line):

bwa mem -M -t 7 -p human_bwa_ref/human_g1k_v37_decoy.fasta markedadapters.fq > mapped.sam

The resulting output file (mapped.sam) contains a line-by-line summary of the mapping outcome for every read.

Conserve original read data by merging mapping output with original unmapped read file.This step will combine data from the original BAM file with results of the mapping output, so all of the information will be available for downstream analyses. Enter the following command (all on one line).

java -jar $PICARD MergeBamAlignment R=human_bwa_ref/human_g1k_v37_decoy.fasta UNMAPPED_BAM=unmapped.bam ALIGNED_BAM=mapped.sam O=merged.bam CREATE_INDEX=true ADD_MATE_CIGAR=true CLIP_ADAPTERS=false CLIP_OVERLAPPING_READS=true INCLUDE_SECONDARY_ALIGNMENTS=true MAX_INSERTIONS_OR_DELETIONS=-1 PRIMARY_ALIGNMENT_STRATEGY=MostDistant ATTRIBUTES_TO_RETAIN=XS

Understanding the long list of command line options is not particularly important for the purposes of the exercise, but consult the GATK and Picard Tools documentation for detailed information on this and other commands.

Mark Duplicate Reads in the Alignment. The task of identifying sequencing artifacts can be more challenging if there are PCR duplicates and other forms of duplicates in your data set, because artifacts in one read can be propagated to multiple reads and appear to be true variants within the sample. This step will identify read pairs with identical start positions as duplicates, so they can be excluded from downstream analyses. Enter the following command (all on one line).

java -jar $PICARD MarkDuplicates I=merged.bam O=markedDuplicates.bam METRICS_FILE=markedDuplicates_metrics.txt OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 CREATE_INDEX=true

Identify (raw) variants using the GATK HaplotypeCaller. The heart of GATK pipelines is variant calling. The HaplotypeCaller tool will identify variants and about them in VCF format. Enter the following command (all on one line).

java -jar $GATK -T HaplotypeCaller -R human_bwa_ref/human_g1k_v37_decoy.fasta -I markedDuplicates.bam --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -o raw_variants.vcf

Remember that $GATKis just an environmental variable that has been set to refer to the location of the GenomeAnalysisTK.jar file. This step will take approximately 12 minutes. While it is running go on to the IGV step below.

Visualize the mapping results with IGV. Open the genome viewer software IGV from the /Applications directory.

In the top left corner of the window, use the drop-down menu to select “Human (1kg, b37 + decoy)”.

Then select FileLoad from File…

And select~/TodosSantos/gatk/markedDuplicates.bam

In the text box at the center of the top panel, enter the following coordinates to navigate to the part of chromosome 10 that we are analyzing in this exercise:10:96,867,400-96,869,400

Zoom in and out and use the viewer to explore the mapping results including, color-coded mismatches to the reference.

Extract SNPs from vcf file and apply quality filters. Once finished, the HaplotypeCaller analysis (step 6) will output a file (raw_variants.vcf) file, which summarize both SNPs and small indels. The following three commands will 1) isolate the SNPs into a separate vcf file, 2) exclude low-confidence SNPs, and 3) summarize the resulting set of SNP calls relative to a database of known human SNPs (dbSNP).

Extract raw SNPs:

java -jar $GATK -T SelectVariants -R human_bwa_ref/human_g1k_v37_decoy.fasta -V raw_variants.vcf -selectType SNP -o raw_snps.vcf

Filter based on SNP quality:

java -jar $GATK -T VariantFiltration -R human_bwa_ref/human_g1k_v37_decoy.fasta -V raw_snps.vcf --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filterName "my_snp_filter" -o filtered_snps.vcf

Summarize filtered SNPs:

java -jar $PICARD CollectVariantCallingMetrics I=filtered_snps.vcf O=SNP_Metrics DBSNP=human_bwa_ref/dbsnp_138.b37.excluding_sites_after_129.vcf

To view the resulting summary:

less SNP_Metrics.variant_calling_summary_metrics

How many high quality SNPs were identified?

How many are novel (i.e., not found in dbSNP)?

How would you test whether some of those SNPs are false positives and identify those false positives?

How would you identify true SNPs that were improperly filtered out?

The full documentation and tutorials for GATK are available here: