Bioinformatics Research Centre (Birc), Aarhus University, Denmark

Bioinformatics Research Centre (Birc), Aarhus University, Denmark

SOAPindel 2.0.1 User Guide

Shengting Li () March 27, 2012

Bioinformatics Research Centre (BiRC), Aarhus University, Denmark

Beijing Genomics Institute, Shenzhen (BGI-Shenzhen), China

$Id: README 32 2012-04-15 15:10:32Z lishengting $

1)Introduction

SOAPindel 2.0 is focusing on calling indels from the next-generation paired-end sequencing data.

2)Requirements

Input data sources:

1.The reference sequence file used to align the reads. It must be in Fasta format.

2.The files with read-alignments. SOAPindel 2.0 accepts indexed BAM files as input. SOAPindel can guess the library insert sizes by itself, but if users could provide the correct ones, it will save some time.

Sequencing platform:

Theoretically, SOAPindel is designed for all paired-end sequencing data because it doesn’t consider any qualities for now. But we’ve only tested it on Ilumina GA data, so we don't guaranty that it can work on 454 data.

System environment:

1.Linux, Unix or any Linux like system.

2.perl version >= 5.8

3.samtools version >= 0.1.18

3)Workflow

1.Cluster reads around indels

2.Do local assembly and alignment to get indels

3.Filter results

4)Command & Input files

perl indel_detection.ibam.pl <mapping.listreference_seq(FILE|DIR)>

1.mapping.list>: reads mapping files list.

File format:

<path_of_alignment_file_1> Lib_insert_size Deviation_of_insert_size Reads_length

<path_of_alignment_file_2> Lib_insert_size Deviation_of_insert_size Reads_length

.

.

.

path_of_alignment_file>: the reads alignment results with indexed BAM format (could be viewed by 'samtools view').

Lib_insert_size: insert size of library

Deviation_of_insert_size: normally set to SD*2 (~20% of library insert size).

Reads_length: maximal read length in the alignment file

2.reference_seq(FILE|DIR)>: reference sequence file or folder.

User should provide a whole file which contains all chromosomes. The script will split it automatically.

User could also split the file by himself. If so, all reference sequence files must be put in this folder and named like 'chr*.fa' (no quotes, "chr" must be lowercase). Each file contains only one chromosome. Every file should be in fasta format and have id: ">chromosome_name " (chromosome_name must be same with the alignment files).

The command could be run directly, but for the big data set (like one or more chromosomes), users are suggested to use "-p" to print out the script and run it manually (please see the details in Parameters section).

If user want to use the "SOAP" or "un-indexed bam" format rather than "indexed bam" format, please use the "indel_detection.pl" script and see the details in the "README.v1".

5)Parameters

----String & Number Parameters:------

-chr chromosomes (ALL)

This parameter could be a string of chromosomes with regions or not, split by plus "+", e.g. "chr7:26730761-27230760+chr7:89428340-90542763+chr20" means detecting all indels from 26730761 to 27230760 and 89428340 to 90542763 on chr7 and the whole chr20. If you want to detect a lot of regions, you could put them in one file (each line contains a region, the file must have a suffix '.list') and use the file name as the parameter. DEFAULT: ALL, means detect every thing in the alignment files.

-l layer_num (2)

The min valley depth allowed in one cluster. Any thinner layer will be cut and start to find a new cluster. DEFAULT: 2, should be no less than 2. Could be set to 1/20 of sequencing coverage to save some running time for high coverage data.

-k kmer size (25)

Dimensional of de Bruijn graph. DEFAULT: 25, should be no more than 1/2 reads length and no less than 10.

-cpucpu number (8)

Number of local CPUs used. For qsub or other multi-task submitting methods, the total CPU number = this number x job number.

-ext extension length for every cluster (max(read_len,50))

In order to find the indels near the edge of cluster, SOAPindel extends every cluster on each side. DEFAULT: max(reads length, 50), take the longer one.

-x_numread_max_hits# (1)

Don’t consider any reads that hit the reference more than this parameter. DEFAULT: 1, means don’t consider any repeats.

-mm max_mismatch (int(read_len/35)+1)

Max mismatches allowed in one read. DEFAULT: (int(reads length/35)+1).

-il max gap between nails (100)

SOAPindel will stop extending the path if the distance of two adjacent nails is too big. This parameter could set the limitation of indel size softly. DEFAULT: 100

-ol overlap length when cut cluster by max length (il)

If the cluster is too long, SOAPindel will cut it into segments with certain length and certain overlap. DEFAULT: 100

-xl max cluster length (ol*3)

If the cluster is longer than this parameter, SOAPindel will cut it into segments with certain length and certain overlap. DEFAULT: 300

-sdp use bigger deviation (deviation*sdp) to filter insert size of pair aligned reads (2.0)

SOAPindel considers a pair of reads unmapped if the distance between them is bigger than insert_size*(1+deviation*sdp). DEFAULT: 2.0

-wdwork_dir (.)

SOAPindel will create this folder for all internal and final result. DEFAULT: current folder. This value could NOT contains slash '/'.

-st path of samtools

-mc max contigs to align (100)

If the local assembly generates too many contigs, the local alignment only aligns part of them. DEFAULT: 100

-mx unique reads max reliable coverage (auto)

Even if filter reads with SNPs, sometimes there are still too many coverage of reads in some regions (mostly repeats) and could affect the assembly unexpectedly. SOAPindel will drop reads randomly to reduce the coverage to a reasonable number.

-bs block size of assemble_align scripts (100)

Split the assemble_align jobs to spare some disk seeking operation.

-qf path of quality tab file ($BIN_PATH/filter/Indel.qual.tab)

-w window size for successive homogeneous indels checking (30)

In order to filter out false positive sites, SOAPindel limits max number of successive homogeneous indels in a certain size window. DEFUALT: 30

-n max successive homogeneous indels #/window (2)

Max number of successive homogeneous indels allowed. DEFAULT: 2, should be set to higher number if the divergence of the reference and sample is too big, e.g. 7 for Human vs Chimp.

-cm path_of_cross_match, only works when -ucm is set ()

----Switch Parameters:------

-nf don't filter reads with SNPs.

-ngdon't count unique gap-mapped reads as matched reads.

-p only print script, don't run.

Print out the script for user to run it manually.

-nb run the assemble_align scripts one line at a time.

-bfa compress internal fasta output.

-qseq print sample seq for PCR.

SOAPindel doesn’t output sequence of sample around the indels by default because it will generate too huge result. When users need to do PCR validation or similar options, this parameter could help.

-qsub use qsub to submit jobs

If users want to use PBS (qsub) or other batch jobs submission system, the $QSUB_OPT ("-d . -l mem=${MEM}gb,walltime=360:00:00,nodes=1:ppn=${CPU}") variable must be modified manually in order to fit the object system.

-np don't print progress in log

-ucm use cross_match to do local alignment (debug)

-t test time and memoery (debug)

6)Output files

Log

SOAPindel stores running log in the WORK_DIR/log folder. Users can use "cat WORK_DIR/log/*.log to check if there is something wrong during the process.

Result

SOAPindel supports VCF4.0 format in this verison. SOAPindel stores results in the WORK_DIR/result/chr*, one folder for one chromosome. There would be 8 files for each chromosome:

chr*.HEAD

chr*.variation.raw.gz: compressed raw results

chr*.variation.sorted: sorted and unique raw results

chr*.variation.list: all variations (include SNPs) with Q value

chr*.indel.vcf: VCF4.0 format of chr*.variation.list without SNPs

chr*.variation.noq.list: all variations (include SNPs) filtered by internal threshold

chr*.indel.noq.list: all indels filtered by internal threshold

chr*.indel.noq.vcf: VCF4.0 format of chr*.indel.noq.list

All the ".list" files are in same format. Here is the description of each columns:

1.cluster regions id

2.assembledcontigs id

3.chromosome

4.type[-size] (S:SNP; D:Deletion; I:Insertion; N:Heterozygote Position)

5.start position on reference chromosome

6.local start position on the cluster.

7.indel range (local start - local end: indels could happen on any position between the start and end, so the local end - local start + size = tandem repeat length, and tandem repeat length / size = repeat times)

8.indel allele

9.average coverage for insertion; minimum coverage for deletion

10.minimum coverage for insertion; overlapping coverage for deletion

11."+" for homozygote; sample genotype / reference genotype ratio for heterozygote

12.L(left flank length from the assembled contig start to indel start)_R(right flank length from indel end to the contig end)

13.10 bp left flank on reference_((indel range sequence without allele) + 10 bp right flank on reference)

14.Empty by default; if the -qseq is set, this column is (left flank of the contig)_(right flank of the contig) without allele sequence.

15.used internal

16.used internal

17.used internal

And there is one statistics file: WORK_DIR/result/stat.q10.tab