Wednesday 6 August 2014

Data Structure and Algorithm - String Pattern Search

It was a very pleasant train journey after work with a couple friends and talking about the interesting algorithms we have encountered. One of the friends talked about some interesting algorithm for string pattern search. And he could not recall exactly what its name is but it can speed up the string pattern search and do come clever step jumping other than the naive method that simply uses the brutal forces of computers. He mentioned the name of algorithm starting with "D"-something and this algorithm has linear complexity of computation. I thought he means "Dijkstra", which is one of the biggest names in computer science.

To the curiosity towards this algorithm I googled well-known string pattern search algorithm. I think what he means is the Knuth-Morris-Pratt algorithm. "Knuth", another big name of computer science. Still remember the first time I read his 3 magnificent books "The Art of Computer Programming". Even now still I have not the courage to finish some chapters in the third volume.

Again this is my another coding exercise on the train for 20 minutes of test code and 30 minutes of code implementation. See here is the list of algorithms stated on wikipedia about string pattern search.

1. Naive method
This is of course the first method that most of developers can think of. Let's assume
    - Text string, S with length m
    - Pattern sting, P with length n
As simple as it is, from the beginning to the end [0, m-1], compare S[i] to P[j], where
    - i within [0, m-1],
    - j within [0, n-1]
If at the first position i, where has S[i+j] = P[j] for every j, where j within [0, n-1]. Then i is the what we want and the match does exist. Otherwise match does not exist and there does not exist such a i.

// String pattern search - Naive method
//********************************************************************************
ptrdiff_t StringPatternSearch_Naive(const std::string& text,
const std::string& pattern)
{
if (text.empty() || pattern.empty()) {
return -1;
}

for (size_t pos_t = 0, pos_p = 0; (pos_t + pos_p) < text.size();) {
if (text[pos_t + pos_p] == pattern[pos_p]) {
++pos_p;
if (pos_p == pattern.size()) {
return static_cast<ptrdiff_t>(pos_t);
}
}
else {
++pos_t;
pos_p = 0;
}
}

return -1;
}
//********************************************************************************

Naive method has the worst case of O(m*n) and average case of O(m+n).

2. Rabin-Karp Algorithm
The idea of Rabin-Karp algorithm is to use hash function to computer the value of every continuous sub-strings with length of n in S and compare the hash value against the hash value of the pattern P. If they are equal, then compare the sub-string in S and pattern P to verify that they are equal.

//  pseudo code - Rabin Karp Algorithm
//********************************************************************************
patternHashValue <-- compute hash value of pattern P
for every pos within [0, m-n) in text S
     substringHashValue <-- compute hash value of sub-string of text S in [pos, pos+n]
     compare patternHashValue against substringHashValue, if true
             compare this sub-string against pattern P, if equal
                      match found and return pos
     pos <-- pos +1

not match found and return -1
//********************************************************************************

The optimization of this algorithm happens in hash function. For each computation next to each other only one character needs to computer. Moving from pos to pos+1, S[pos] is replaced by S[pos+n] and all the computation in hash function on S[pos+1],..., S[pos+n-1] has been calculated previously. With a good design/implementation of hash function, Rabin-Karp Algorithm can achieve O(n+m) on average.

3. Knuth-Morris-Pratt Algorithm
The idea of Knuth-Morris-Pratt algorithm is to extract information from pattern P and use this information as the heuristic rules to guide the search. Comparing with naive method this algorithm is trying to increment pos_t and pos_p with variously steps to speed up the performance.

A simple example of this heuristic rule is that: assuming pattern P is consisted of all different characters (no single character appears twice). For say at position i at text S and j at pattern P, S[i+j] is equal to P[j], for every j within [0, k], where k < n-1. But S[i+j+1] is not equal to p[j+1]. pos_t is incremented by 1 and pos_p is assigned to 0, according to naive method. But here we know actually pos_t can be increment by j+1, because every character in pattern P is different and there is no match that S[l], where l within [i, i+j+l] is equal to P[0]. This is telling us that there must be some information in pattern P that can be used as heuristic rules to guide the search.

This heuristic rules are based on the repeat of  sub-string in pattern P. The upcoming string is always trying to match from the start of P[0]. More detail please refer to Knuth-Morris-Pratt algorithm on wikipedia. And this algorithm can achieve linear complexity, O(m).

// String pattern search - Knuth-Morris-Pratt
//********************************************************************************
void BuildUpKMPTable(const std::string& pattern, std::vector<ptrdiff_t>& table)
{
if (pattern.empty()) {
return;
}

table.clear();
table.reserve(pattern.size());
table.push_back(-1);

ptrdiff_t matchCount = 0;
for (size_t pos = 1; pos < pattern.size(); ++pos) {
table.push_back(matchCount);
if (pattern[pos] == pattern[matchCount]) {
++matchCount;
}
else {
matchCount = 0;
}
}
}

ptrdiff_t StringPatternSearch_KMP(const std::string& text,
 const std::string& pattern)
{
    if (text.empty() || pattern.empty()) {
return -1;
}

std::vector<ptrdiff_t> kmpTable;
BuildUpKMPTable(pattern, kmpTable);

for (size_t pos_t = 0, pos_p = 0; (pos_t + pos_p) < text.size();) {
if (text[pos_t + pos_p] == pattern[pos_p]) {
++pos_p;
if (pos_p == pattern.size()) {
return static_cast<ptrdiff_t>(pos_t);
}
}
else {
if (kmpTable[pos_p] < 0) {
++pos_t;
pos_p = 0;
}
else {
pos_t = pos_t + pos_p - kmpTable[pos_p];
pos_p = kmpTable[pos_p];
}
}
}

return -1;
}
//********************************************************************************

4. Summary
Finite-state automation search algorithm can also achieve linear complexity but is more expensive on pre-processing than Knuth-Morris-Pratt algorithm. Boyer-Moor search algorithm also process the pattern string to get heuristic rules to guide search but it's worst case is O(m*n). More details about the comparison on these algorithms, please refer to String Search Algorithm on wikipedia.


Bibliography:
[1] http://en.wikipedia.org/wiki/String_searching_algorithm
[2] http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
[3] http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
[4] http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore%E2%80%93Horspool_algorithm

No comments:

Post a Comment