Chapter 3
String Matching
To search some designated strings in computers is a classical andimportant problem in computer science. We can search or replacesome texts in word processors, find some interesting websites fromsearching engines (like Yahoo and Google) andsearch for papers from digital libraries.The main problem of those operations is string matching.
Many databases (like GenBank) were built to support DNA and protein sequencescontributed by researchers in the world.Since DNA and protein sequences can be regarded as character strings composed of4 and 20 different characters respectively, the string matching problem isthe core problem for searching these databases.To design efficient string matching algorithms will become a more and more importantresearch area as the size of databases grows.
In this chapter, we shall introduce two efficientalgorithms for the exact string matching problem. Moreover, we shall introduce twoimportant data structures, the suffix tree and array, for string matching problems.Finally, we introduce an algorithm for the approximate string matching problem.
3.1 Basic Terminologies of Strings
A string is a sequence of characters from a alphabet setΣ.For example, "AGCTTGA" is a string fromΣ ={A,C,G,T}.The length of a string S, denoted by |S|, is the number of characters of S.For example, the length of "AGCTTGA" is 7.Let be the ith character in the left-to-right order forstring S, where i>= 1.Let denote the string …, where i <=j.String S' is called a substring of string S if S'= forsome i, where 1 <=i <= |S|-|S'|+1.String S' is called a subsequence of string S if there existsan increasing sequence … such that= for 1 <=i<= |S'|.For example, "GCTT" and "TTG" are substrings and subsequences of "AGCTTGATT",while "ACT" and "GTT" are subsequences of "AGCTTGATT".String S' is called a prefix of string S if S'=.String S' is called a suffix of string S if S'=.For example, "AGCTT" and "AG" are both prefixes of "AGCTTGATT",while "TT" and "GATT" are both suffixes of "AGCTTGATT".
Given a text string T of length n and a pattern string P oflength m, the exact string matching problem is to find all theoccurrences of P in T.It is easy to design a brute-force algorithm for this problem, which enumerates all n– m+1possible substrings of length m in T by shifting a sliding window oflength m and determines which substrings are P.The algorithm is given below:
Algorithm 3.1 A brute-force algorithm for exact matching
Input: A text string T of length n and a pattern string P of length m.
Output: All occurrences of exact matchings of P in T.
fori = 1 ton – m+1 do
if ( = ) then
Report that pattern P appears at position i;
endif
endfor
Since it takes O(m) time to determinewhether two strings of length m are the same, the algorithm conductsin O((n – m+1)m)=O(nm) time.
3.2 The KMP Algorithm
In this section, we shall introduce the KMP algorithm, proposed byKnuth, Morris and Pratt in 1977 to speed up the exact matching procedure.We will use several cases to illustrate their idea.
Case 1: In this case, the first symbol of P dose notappear again in P. Consider Figure 3.1(a).The first mismatch occurs at and . Since to exactly match with to and the first symbol of P dosenot appear any where in P, we can slide the window all the wayto and match with , as shown in Figure 3.1(b).
Case 2: In this case, the first symbol of P appears again inP. Consider Figure 3.2(a). The first mismatchoccurs at and . Since the first symbol of P doseoccur in P again, we can not slide the window all the wayto match with . Instead, we must slide the window to match with because ="A"=. This is shown in Figure 3.2(b).
Case 3: In this case, not only the first symbol of P appears inP, prefixes of P appear in P twice.Consider Figure3.3(a).First, note that the string is equal to theprefix and the string is equal to the prefix .In Figure 3.3(a), the first mismatchoccurs at and .If there is a good preprocessing, we will go back to andfind out that is equal to the prefix .Besides, since mismatch occurs after , we know that matches with . Therefore, we can slide the windowto align with and start matching with ,as shown in Figure 3.3(b).
The basic principle of the KMP algorithm is illustrated inFigure 3.4.In general, given a pattern P, we should first conduct apreprocessing to find out all of the places where prefixes appear.This can be done by computing a prefix function.
For 1 <=j<=m, let the prefix function f(j) for bethe largest kj such that = ; 0 if there is no such k.The prefix function is illustrated in Figure 3.5.Note that f(1) = 0 for any cases.For example, the prefix function f for P="ATCACATCATCA" isshown in Figure 3.6.
(a)
(b)
Figure 3.1: The First Case for the KMP Algorithm
(a)
(b)
Figure 3.2: The Second Case for the KMP Algorithm
(a)
(b)
Figure 3.3: The Third Case for the KMP Algorithm
Figure 3.4: The KMP Algorithm
Figure 3.5: The Prefix Function
Figure 3.6: An Example of the Prefix Function
If the prefix function is computed first andstored in an array, the window shifting will be determined easily by looking up thearray. Consider the pattern P="ATCG".If the mismatch occurs when comparing with ,then we can continue comparing with=, where f(j)=0, 1 <=j <= 4, for P="ATCG".Consider the pattern P="ATCACATCATCA" shown in Figure 3.6.If the mismatch occurswhen comparing with , then we can continue comparing with==.
In general, if a mismatch occurs when is compared with ,then we align with if j≠1 andalign with if j=1.
In the following, we shall discuss how to find this prefixfunction. Consider the prefix function of Figure 3.6.
(1) Suppose that f(4) has been found to be 1. Let us see whatthis means. This means that =. Based upon this information, wecan determine f(5) easily. If =, then we setf(5)=f(4)+1; otherwise, we set f(5)=0. In this case, since≠, we set f(5)=0.
(2) Suppose we have found f(8)=3. This means that=. We now see whether =. Since =,we set f(9)=f(8)+1=4.
(3) Suppose we have found that f(9)=4. We now check whether=. The answer is no. Does this mean that we should setf(10) to be 0? No. Let us note that f(9)=4 and f(4)=1. Thismeans that ==. In other words, we have a pointerpointing to now. We check whether =. isindeed equal to . We therefore set =f(1)+1=0+1=1.
Let = for x > 1 and = f(y).The prefix function f(j), 2 <=j<=m,for can be rewritten as follows:
It can be shown that this new formula is equivalent to the old one.Consider the pattern P in Figure 3.6 again.Note that f(1) is 0 for any cases. f(2) and f(3) are 0by the definition. We set f(4) = 1 because = = = "A". However, we let f(5) be 0 because "C" = ≠ = = "T" and "C" = ≠ = = "A".In addition, we have f(6) = 1, f(7)=2, f(8)=3 and f(9)=4 because = = = "A", = = = "T", = = = "C" and = = = "A", respectively.Since "T" = ≠ = = "C" and = = = "T", we have f(10)=2.Finally, we set f(11) = 3 and f(12)=4 because = = = "C" and = = = "A", respectively.
The complete algorithm for computing f is given below:
Algorithm 3.2An algorithm for computing the prefix function f
Input: A string P of length m.
Output: The prefix function f for P.
f(1) = 0;
forj = 2 tomdo
t = f(j-1);
while (≠ andt≠ 0) do
t = f(t);
endwhile
if (t = 0) then
f(j)= 0;
else
f(j)= t+1;
endif
endfor
The KMP algorithm consists of two phases. It computes the prefix functionfor the pattern P in the first phase, and searches the patternin the second phase.Refer to Algorithm 3.3 for the details.
Let us see an example for the KMP algorithm as shown inFigure 3.7. Initially, = is aligned with= (see Figure 3.7(a)). Next, the algorithmcontinues comparing and because ≠ (see Figure 3.7(b)).A mismatch occurs at and .Then it compares with ==(see Figure 3.7(c).After that, since ≠, the algorithmcontinues comparing with (see Figure 3.7(d)).Fortunately, we match and successfully and reportthat pattern P appears at position i– m+1=17 – 12+1=6 of T.The next pairwise-comparing starts at and ==(see Figure 3.7(e)).Since a mismatch occurs at and ,the algorithm continues comparing and =(see Figure 3.7(f)).The remaining comparisons are left for exercises.
It can be proved that the algorithm runs in O(n+m) time; O(m) time forcomputing function f and O(n) time for searching P.The analyses of the algorithm are quite difficult and need theknowledge of amortized analysis, which will be omitted here.
Algorithm 3.3The KMP Algorithm for exact matching
Input: A text string T of length n and a pattern string P of length m.
Output: All occurrences of P in T.
/* Phase 1 */
Compute f(j) for 1 <= j <= m;
/* Phase 2 */
i=1, j=1;
while (i <= n) do
if ( = andj = m) then
Report that pattern P appears at position i – m + 1;
j = f(m) + 1;
i = i + 1;
endif
if ( = andj≠m) then
j = j + 1;
i = i + 1;
endif
if ( = andj≠1) then
j = f(j - 1) + 1;
endif
if ( = andj= 1) then
i = i + 1;
endif
endwhile
3.3 The Boyer-Moore Algorithm
The Boyer-Moore algorithm was proposedby Boyer and Moore in1977.The worst time complexity of this algorithm is no better than that of KMP algorithm,but it is more efficient in practice than the KMP algorithm.
This algorithm compares the pattern with the substring within a sliding windowin the right-to-left order. In addition, bad character andgood suffix rules are used to determine the movement of sliding window.
Suppose that is aligned to now, and we performpairwise-comparing from right to left. Assume that the firstmismatch occurs when comparing with ; that is, = for 0 <=k <=m – j-1, and ≠ as shown in Figure 3.8. We have the following possible ways to shift the sliding window.
Figure 3.7: An Example for the KMP Algorithm
Since ≠ , any exact matching which moves the window to the right mustmatch some character to the left of in P exactly with . For example, consider Figure 3.9(a). In this case, m=12, s=6 and j=10,="G"≠ .We scan from tothe left and find ="G"=. Therefore, we move the window to the right to alignwith , as shown in Figure 3.9(b).The basic idea of this rule is illustrated in Figure 3.9(c)and we have the following rule:
Bad Character Rule:Align with , where j' is the rightmost position of in P.
Figure 3.8: = for 0 <= k <= m – j -1, and ≠
(a)
(b)
(c)
Figure 3.9: Bad Character Rule
In the above bad character rule, only one character is used. We can actually do better than that by using the so called good suffix rule. Let us consider Figure 3.10(a). Inthis case, m=12, s=6 and j=10. We note that =="CA" and="G"≠="T". We therefore should find the rightmost substring in the left of inPwhich is equal to "CA" and the left character to this substring is not equal to ="T",if it exists. In our case, we find that ="CA" and ≠="T".Therefore wemove the window right to align with , as shown inFigure 3.10(b). The basic ideaof this rule is now illustrated in Figure 3.10(c)and we have the following rule:
Good Suffix Rule 1:Align with , where j' (m – j+1 <=j'm) is thelargest position such that is a suffix of (i.e. =) and ≠ (see Figure 3.11).
(a)
(b)
(c)
Figure 3.10: Good Suffix Rule 1
Figure 3.11: The Movement for Good Suffix Rule 1
(a)
(b)
(c)
Figure 3.12: Good Suffix Rule 2
In the following, we will further make use of a matching. Consider Figure 3.12(a).In this case, m=12, s=6 and j=8. In the previous good suffix rule, we will try tomove the window so that another substring of P would match exactly with the="AATC". This is not possible because there is no other substring in P whichis equal to "AATC". Yet there is a prefix, namely , which is equal to "ATC"which is a suffix of "AATC".We therefore move the windowso that matches with .The basic ideaof this rule is now illustrated in Figure 3.12(c)and we have the following rule:
Good Suffix Rule 2:Align with , where j' (1 <=j'<=m-j) isthe largest position such that is a suffix of (i.e. =)(see Figure 3.13).
Let B(a) be the rightmost position of aΣin P.Figure 3.14(a) shows the values of B forΣ ={A,T,C,G} and P="ATCACATCATCA".This function will be used for applying bad character rule.The algorithm for computing B is described in Algorithm 3.4.
Figure 3.13: The Movement for Good Suffix Rule 2
(a)
(b)
Figure 3.14: Two Functions for the Good Suffix Rule: (a)Function B (b)Function G
Algorithm 3.4An algorithm for Computing Function B
Input: A pattern P of length m.
Output: Function B for P.
forj=1 tomdo
B(j) = 0;
endfor
forj=1 tomdo
=j;
endfor
As for the good suffix rule, we need to scan the pattern for storing the information aboutthe maximum number of shifts in each position of P.Let be the largest k suchthat is a suffix of and ≠ , where m – j+1 <=km; 0 if there is no such k.Let be the largest k such that is a suffix of , where 1 <=k<=m – j; 0 if there is no such k.Functions and are illustrated inFigure 3.15(a) and (b) respectively.Consider and for P="ATCACATCATCA", which is shown in Figure 3.14(b).We set =9 because ="CATCA"=, which is a suffix of , and ≠ ,while we set =4 because ="ATCA"=, which is a suffix of .However, we set =0 by definition,while we set =4 because ="ATCA"=.
(a)
(b)
Figure 3.15: Functions and
Let us consider . Note that =9. This means that="CATCA" must be equal to and ≠ .Suppose a mismatch occurs at , as shown in Figure 3.16(a).We can move the window m-=12 – 9=3 positions, as illustratedin Figure 3.16(b).
Consider . Note that =4. This means that is a suffix of . That is, must be equal to .If a mismatch occurs at , as shown in Figure 3.17(a),we can move the window m-=12 – 4=8 positions, as illustratedin Figure 3.17(b).
(a)
(b)
Figure 3.16: Shifting for the Good Suffix Rule 1
(a)
(b)
Figure 3.17: Shifting for the Good Suffix Rule 2
Function G(j) is defined asfollows:
Actually, G(j) indicates the maximum number of shifts by goodsuffix rule when a mismatch occurs for comparing with some character in T.Figure 3.14(b) shows the values of G for P="ATCACATCATCA".Note that G(m) is always 1.
For 1 <=j <=m-1, let the suffix function f'(j) for be thesmallest k, j +2 <=k<=m, such that =; m+1 if there is no such k.This is illustrated in Figure 3.18.Let f'(m)=m+2for .It is easy to see that function f' is similar to the prefix function for reverse of P.Figure 3.19 shows the values of f' for P="ATCACATCATCA",where m=12. We set f'(12)=12+2=14. Since there is no k for 13=j+2 <=k<=m=12,f'(11) is set to 12+1=13.Since =≠ =for j=10 and j +2=12 <=k<=m=12, we get f'(10)=m+1=13.Similarly, f'(9)=12+1=13.For j= 8 and j +2=10 <=k<=m=12, we let f'(8)=12because 12 is the smallest value of k such that=="A"==.For j=7 and j +2=9 <=k<=m=12, we let f'(7)=11because 11 is the smallest value of k such that= ="CA"==.
Figure 3.18: The Suffix Function f’
Figure 3.19: Function f’ and G
Let = for x > 1 and = .Function f' can be redefined as follows:
By comparing Figure 3.5 andFigure 3.18, we can easily see the suffixfunction f' can be computed as f, except the direction isreversed.
Function G can be determined by scanning P twice, where the first one is a right-to-leftscan and the second one is a left-to-right scan.Function f' is generated in the first right-to-left scan and some values of G can bedetermined in this scan.Consider Figure 3.15 again. We can easily observethe following fact:
Fact 1: If we scan from right to left and is determined during the scanning,then >=.
Consider Figure 3.19 again.Suppose that = is considered now. We set f'(4) to 8.This means that =="CATCA"==.In addition, if =≠ =, we know = m+j-f'(j)+1 = 9.By Fact 1, G(j)=m-max{,}= m- = m-(m+j-f'(j)+1) = (f'(j)-1)-j = 8 – 1-4 = 3.
When we compute f', there may be a "while" loop.Function can be determined when we perform this "while"loop. That is, if t=f'(j) – 1 <=m and ≠, then=m – t+j, as illustrated in Figure 3.20.
Figure 3.20: The Computation of
Suppose we have scanned from right to left and determined f'(j)for j=1, 2, …, m. Now, consider f'(1).From Figure 3.19, f'(1)=10.This means that =. Let t be f'(1)-1.Let us now observe ==="A"=.This means =. Thus fromFigure 3.15, =4=m-(f'(1)-1)+1=m-f'(1)+2. Besides, it can be easily seen that = for j=2, 3, …, f'(1)-2.This is illustrated in Figure 3.21.
Figure 3.21: The Computation of
Let k' be the smallest k in {1, …, m} such that = and <=m.If G(j') is not determined in the first scanand 1 <=j'<=, thus, in the second scan, weset G(j')=m-max{,}=m-=.If no such k' exists, set each undetermined value of G to m in the second scan.For example, since ==="A"=, weset G(1), G(2), G(3), G(4), G(5), G(6) and G(8) tof'(1) – 2=8.
In general, let z= and f''(x)=f'(x)-1.Let k'' be the largest value k such that <= m.Then we set G(j')= m- = m - (m-)=, where 1 <=i<=k'' and j'<=and =z.For example, since z=8 and k''=2, we set G(9) and G(11) to=f'(8) – 1=12 – 1=11.
The complete algorithm for computing G using f' is listed below:
Algorithm 3.5An algorithm for computing function G
Input: A string pattern P of length m.
Output: Function G for P.
forj=1 tom-1 do
G(j)=0;
endfor
f'(m)=m+2;
t=m;
forj=m-1 to1 do
f'(j)=t+1;
while (t<=mand≠) do
if (G(t) = 0) then
G(t)=t-j;
endif
t=f'(t)-1;
endwhile
t=t-1;
endfor
forj=1 tom-1 do
if(G(j)=0) then
G(j)=t;
endif
if (j=t) then
t=f'(t)-1;
endif
endfor
We essentially have to decide the maximum number of steps we can move the windowto the right when a mismatch occurs. This is decided by the following function:
max{G(j),j – B()}.
The Boyer-Moore algorithm has two phases, one for computing B and G and the other onefor searching patterns.The complete algorithm is described in the following:
Algorithm 3.6The Boyer-Moore algorithm for exact matching
Input: A text string T of length n and a pattern string P of length m.
Output: All occurrences of P in T.
/* Phase 1 */
Compute G;
Compute B;
/* Phase 2 */
s=1;
while (s <=n – m+1) do
j=m;
while ( = andj>= 1) do
j=j-1;
endwhile
if (j=0) then
Report that pattern P appears at position s;
s=s+G(1);
else
s=s+ max{G(j),j-B()}
endif
endwhile
Consider the text T and pattern P given in Figure 3.22,where n=24 and m=12.The functions G and B are shown inFigure 3.14.The Boyer-Moore algorithm first sets s=1 and starts the pairwise-comparing in the right-to-leftorder (see Figure 3.22(a)).The first mismatch occurs at = and the algorithm recomputess by the following statement:
s = s+max{G(j),m-B()}
= 1+max{G(12),12-B()}
= 1+max{1,2}
= 3
Figure 3.22: An Example for the Boyer-Moore Algorithm
Then we shift the window to position s=3 and starts a newpairwise-comparing (see Figure 3.22(b).Next, the mismatch occurs at = and s is reset by thefollowing statement:
s = s+max{G(j),m-B()}
= 3+max{G(7),12-B()}
= 3+max{3,0}
= 6
So we shift the window to position s=6 and starts a newpairwise-comparing (see Figure 3.22(c).After m comparisons, a P is found at position s=6.Then we recompute s by the following statement:
s = s+G(1)
= 6+8
= 14
Since s=14 > n – m+1=24 – 12+1=13, the algorithm terminates.
Preprocessing time in phase 1 is O(m)+O(m+|Σ|) = O(m+|Σ|),where O(m+|Σ|) time is for computing B and O(m) time for G.However the worst time for phase 2 would be O((n – m+1)m).It was proved that this algorithm has O(m) comparisons when P is not in T.However, this algorithm has O(mn) comparisons when P is inT. Many Boyer-Moore-like algorithms (such as Apostolico-Giancarlo algorithm)were derived and have O(m) time in the worst case.It is more efficient in practice than KMP algorithm.
3.4 Suffix Trees and Suffix Arrays
Let S be a string ofn characters. Let denote the suffix of S for 1 <=i<=n. Forexample, if S is ATCACATCATCA, its 12 suffixes are listed inTable 3.1.
Table 3.1: Suffixes for S = “ATCACATCATCA”
The basic idea of suffix tree concept is that we can classify all of thesubstrings of a given sequence into groups. For example, consider theabove sequence S=ATCACATCATCA. There are only three characters whichappear in this sequence. They are A, C and T. Thus, we can classifyall of the substrings into three groups: (1) the substrings which startwith A, (2) the substrings which start with C and (3) the substringswhich start with T. Note that A, for instance, appears in locations 1,4, 6, 9 and 12. This means any substring which starts with A must be oneof the following suffixes: S(1), S(4), S(6), S(9) and S(12). Similarly,any substring which starts with C must be only of the following suffixes:S(3), S(5), S(8) and S(11). The suffix tree, as we shall introduce inthe follows, groups the suffixes in such a way that whether any substringappears in S can be found easily by using the above observation.
A suffix tree of S of length n is a tree with the following properties:
Each tree edge is labeled by a substring of S.
Each internal node has at least 2 children.