Implementation of motif finding algorithm SIMPLE_CONSENSUS

This algorithm finds a position weight matrix (PWM) motif in an input set of DNA sequences.

1. The input is :

  • a set of sequences S = {S1,S2,…,SN} in “FastA” format.
  • Desired motif length k.
  • Number of motifs to report, T.

2. The output is :

  • A set of “motifs” M = {m1, m2, … , mT}. Each motif mi is a multi-set of k-mers (k-length substrings), one taken from each input sequence Si.

3. The score of a motif is the information content of the PWM formed by the k-mers in the motif.

4. The algorithm SIMPLE_CONSENSUS works as follows:

Consider any arbitrary ordering {S1,S2,…,SN} of the sequences in set S, e.g., the order in which the sequences were included in the input FastA file.

Iteration 1: For every pair of k-mers x1 (substring of S1)and x2(substring of S2), compute the information content of the corresponding PWM. Store the T highest scoring pairs.

Iteration t : For every k-mer xt+1 (substring of St+1), and for every motif m stored in the previous iteration, add xt+1 to m to obtain a new motif, and compute the information content of the corresponding PWM. Store the T highest scoring new motifs from this iteration.

The algorithm terminates when each Si  S has been handled, i.e., after iteration (N-1).

Problem A. Implement a function that creates a PWM out of a given motif, and computes its information content. The formula for information content I(W) of a PWM W is :


where is the entry for base  in column k of W. The background probabilities of the bases are denoted by q. Assume a background probability distribution of {A = T = 0.35, C = G = 0.15}. If is 0, assume log to be 0.

Problem B. Implement the algorithm SIMPLE_CONSENSUS, as described above. The input sequences are given in FastA format, described at the end. The inputs k (motif length) and T (number of motifs to report) are provided in the command-line.

Problem C. Run your SIMPLE_CONSENSUS program on each of the three provided FastA files (bicoid.fa, kruppel.fa, hunchback.fa), with k=9.

FastA format of sequences:

A FastaA file can contain one or many sequences. Each sequence begins with a “header” line, whose first character is “>” and the string immediately following this is the name of the sequence. The actual sequence (string over {A,C,G,T}) begins with the next line, and may be split over multiple lines. For example:

>seq1

ACACAGTACAGCTACGATCAGCATCAG

CAGCTACGACGAGCATCAGCGACTACGACTAC

CAGCATCGATCAGC

>seq2

CAGGCGTAGCGATCAGCGAGCGACGAGCACTTCGAGCATCAG

CGATCGAGCAGCTACGACTACGATCAG

CGACTAGCATAGCAC

>seq3

GCAGCGATCAGC