Implementation of motif finding algorithm SIMPLE_CONSENSUS
This algorithm finds a position weight matrix (PWM) motif in an input set of DNA sequences.
1. The input is :
- a set of sequences S = {S1,S2,…,SN} in “FastA” format.
- Desired motif length k.
- Number of motifs to report, T.
2. The output is :
- A set of “motifs” M = {m1, m2, … , mT}. Each motif mi is a multi-set of k-mers (k-length substrings), one taken from each input sequence Si.
3. The score of a motif is the information content of the PWM formed by the k-mers in the motif.
4. The algorithm SIMPLE_CONSENSUS works as follows:
Consider any arbitrary ordering {S1,S2,…,SN} of the sequences in set S, e.g., the order in which the sequences were included in the input FastA file.
Iteration 1: For every pair of k-mers x1 (substring of S1)and x2(substring of S2), compute the information content of the corresponding PWM. Store the T highest scoring pairs.
Iteration t : For every k-mer xt+1 (substring of St+1), and for every motif m stored in the previous iteration, add xt+1 to m to obtain a new motif, and compute the information content of the corresponding PWM. Store the T highest scoring new motifs from this iteration.
The algorithm terminates when each Si S has been handled, i.e., after iteration (N-1).
Problem A. Implement a function that creates a PWM out of a given motif, and computes its information content. The formula for information content I(W) of a PWM W is :
where is the entry for base in column k of W. The background probabilities of the bases are denoted by q. Assume a background probability distribution of {A = T = 0.35, C = G = 0.15}. If is 0, assume log to be 0.
Problem B. Implement the algorithm SIMPLE_CONSENSUS, as described above. The input sequences are given in FastA format, described at the end. The inputs k (motif length) and T (number of motifs to report) are provided in the command-line.
Problem C. Run your SIMPLE_CONSENSUS program on each of the three provided FastA files (bicoid.fa, kruppel.fa, hunchback.fa), with k=9.
FastA format of sequences:
A FastaA file can contain one or many sequences. Each sequence begins with a “header” line, whose first character is “>” and the string immediately following this is the name of the sequence. The actual sequence (string over {A,C,G,T}) begins with the next line, and may be split over multiple lines. For example:
>seq1
ACACAGTACAGCTACGATCAGCATCAG
CAGCTACGACGAGCATCAGCGACTACGACTAC
CAGCATCGATCAGC
>seq2
CAGGCGTAGCGATCAGCGAGCGACGAGCACTTCGAGCATCAG
CGATCGAGCAGCTACGACTACGATCAG
CGACTAGCATAGCAC
>seq3
GCAGCGATCAGC