COMP.3500.202/5800.202HW 5Spring2017

Report due on 3/27/16 (M)

Objective: Design and write an OLC (Overlap Layout Consensus) program for DNA sequencing.

  1. Data

[Undergraduates] Download the first two chromosome sequences of Yeast (S. cerevisiae) (NC_001133.fna and NC_001134.fna) from

(pick Genebank ID or RefSeq ID, and use FASTA format)

[Graduates] Download the first and the lastchromosome (CHR01 and CHR22) sequences of Humanfrom ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens).

Delete all occurrences of N in nucleotides.

Simulate reads by randomly segmenting the genome in chunks of average 400 bp.Each read can be randomly (Gaussian) distributed with the average of 400 bp, but you may use uniform lengths of 400 bp in reads.

You have to generate enough reads so that an arbitrary base position is found in 10 reads, on the average, namely the coverage depth is 10. As in the slides, in order to have the coverage depth of 10, the total number of bases in fragments has to be 10 times the genome size.

Real reads include both 5’ to 3’ and 3’ to 5’ reads. For simplicity, you assume that the simulated reads are only for 5’ to 3’ reads.

2.Write a program(s) to create the overlap graph from reads.An overlap(si, sj) is defined as the length of the longest matches between the suffix of siand the prefix of sj .In order to compute overlaps, you need to perform n*(n-1) comparisons, where n denotes the number of reads.

The following is an example with 7 reads, and the overlap value is in the parentheses.

readoverlap

1. TACCTTG2(3) 4(1) 7(1)

2. TTGAT3(3)

3. GATATGG4(2) 7(1)

4. GGAG3(1) 7(1)

5. CTCTA1(2) 6(3)

6. CTAGT

7. GCTCT2(1) 5(4) 6(2)

3.Each read in part 2 above becomes a node in a graph. Each link (edge) between nodes represents the overlap value. The sequencing problem becomes the traveling salesman problem, visiting every graph node with the largest sum of overlap values. You can try a greedy algorithm. Start with the largest overlap value. Follow the path with largest overlap values.

4.Run the assembly program for only one chromosome. Compare the resulting assembly with the original sequence from NCBI.

5.Run the assembly program for read from two chromosomes, and determine if contigs in two chromosomes are distinguished.

What to submit:

Submit a report including the following:

The data structure that you used to store the simulated read fragments with a discussion on an extension to sequence sizes of mammals.

A discussion on the comparison of your assembled sequence with the original sequence.

Any references, if used.

Your program in an appendix.

Email your report to