Realistic Gap Models

No-gap Alignment

So far, we have treated the gap symbol “-” as yet another character, denoting an individual insertion or deletion. However, this view is not always adequate. Sometimes we want no-gap alignments. For example, in a family of proteins there may be a strongly conserved subunit which is the site of some protein-protein interaction. Any deletion/insertion in the chain of amino acids would be likely to destroy its biochemical function. Such regions we want to align using matches/replacements only. This of course can be achieved by a very simple algorithm. But also our dynamic programming algorithm can be geared to do this by setting costs for insertion and deletion to infinity (or something close to it). Hence, an optimal alignment will not use gaps.

Block-indel

Sometimes, from an evolutionary point of view, it is more realistic to assume that nature inserts or deletes entire substrings as a unit. This is called the block-indel-model. It means that we charge a certain set-up cost for introducing a new gap, whereas extending an existing gap is less expensive. For example, a linear gap cost function is of the form, whereis the length of the gap (number of consecutive “-”), and . Our dynamic programming scheme can be adapted to this model without much effect on its efficiency.

Variations of Pairwise Alignment

Local Alignment

The notion of edit distance and its implementation via dynamic programming are easily adapted to variations of the original problem. Two such variations are discussed here.

We first discuss the problem of local alignment, where s is relatively short with respect to t and we seek that subunit of t which s aligns best with. More precisely: given and, find such that is minimal among all choices of . Very sophisticated methods have been developed to do this. But a simple change to the dynamic programming scheme will also do the job, albeit a little slower.

(Local Alignment Recursion) Local alignment means that we do not charge costs for

1.  deletion of a prefix . (which means that , and hence we initialize the first row of the distance matrix accordingly: .)

2.  deletion of a suffix . (which means that

So the last line of the cost matrix is calculated according to

As before, gives the cost of the optimal local alignment. The matching subsequence is then found as follows:

(Local Similarity) For the second variation of our similarity theme, consider the following case. Let s and t be two proteins, which we suspect to carry some functionally related subunits. However, most parts of s and t do not contribute to this function, and may well be very different. A pairwise alignment of s and t would be unlikely to clearly exhibit small regions of high similarity, as it is geared to minimize distance over the full length of both sequences. Thus we are faced with another problem, this one being symmetric between s and t. We ask for those subunits of s and t that exhibit most similarity. This is called the local similarity problem:

This still leaves a lot of freedom to differentiate degrees of (dis-)similarity. The point is that we use the score 0 as a cut-off value between subsequences with/without similarity. We are now maximizing similarity rather than minimizing distance. The border cases now become and the general recursion formula is

The cut-off value of zero means that long stretches of dissimilarity show as regions of zeroes in the matrix, from which stretches of local similarity rise as islands of positive values.

Heuristic Methods

(Edit Dist. Calc. Complexity) Let us now consider the computational resources needed for calculating edit distances: The dynamic programming scheme calculates (for input sequences and ) matrix entries. Each matrix element can be calculated from the three adjacent elements with a fixed number of operations. Then the execution time of this program is proportional to . In the jargon of computer science, we say that its time complexity is .