Tuesday 25 November 2014

Data Structure and Algorithm - Detect the Repetitive Sub-string

1. Problem Description
Yet again this is a Google Interview Question for Software Engineer/Developer on careerup. Here is the description on that thread.

"Given a string (1-d array) , find if there is any sub-sequence that repeats itself. 
Here, sub-sequence can be a non-contiguous pattern, with the same relative order. 

Eg: 

1. abab <------yes, ab is repeated 
2. abba <---- No, a and b follow different order 
3. acbdaghfb <-------- yes there is a followed by b at two places 
4. abcdacb <----- yes a followed by b twice 

The above should be applicable to ANY TWO (or every two) characters in the string and optimum over time. 

In the sense, it should be checked for every pair of characters in the string."

2. Solution
As we are looking for repeating sub-string, therefore any character that appears only once will not count. We only care about these characters that appears  multiple times.

A string with characters appearing more than twice
Given a string as "...asdfasxabcbdkldjfadxkldjkadxadsfads...", character 'x' appears 3 or more times, then I believe that this is a repetitive string, because the second 'x' follows up the first one and the third one follows up the second. So it is repetitive.
Therefore a string has any character appearing more than twice, then it is a repetitive string.

A string with all characters appearing no greater twice
As we discussed above, those characters that appear only once are out of interest and they can be removed as noise. Assume that we remove the noise and only characters that appear twice remain. Let's examine what they should look like. Let's take an arbitrary example, a, b, c and d appear twice. If we have starting string as "abcd", what could be the next string in order to make this string non-repetitive. Obviously here are 4 options.
    - If it is 'a', then definitely this is a repetitive string because b, c, and d will appear after second 'a'.
    - If 'b', then definitely this is a repetitive string because c and d will appear after second 'b'.
    - If 'c', then definitely this is a repetitive string because d will appear after second 'c'.
    - If 'd', undecided and dependent on what the next follows up.

This is actually telling us a very interesting story. If a string, whose characters all appear twice, has to be a palindrome, then it will be qualified as non-repetitive string. Otherwise it will definitely be a repetitive string. Here summarize the pseudo-code:
    - Detect each character's appearance. Return true, if any appears more than twice
    - Strip out all characters that appear only once.
    - Check if the rest string is palindrome. If yes, return false otherwise return true.

Complexity analysis
A linear solution is implemented. Use a hash map to record/detect the appearance of characters, which can be used to strip out all the characters that appear only once.Then checking if the remaining string is palindrome can be done in linear solution as well by doing comparison from both ends. So overall this is a linear solution on both algorithm complexity and space complexity.

3. C++ Implementation
// ********************************************************************************
// IMPLEMENTATION
// ********************************************************************************
#include <string>
#include <unordered_map>
#include <vector>

#include <cassert>

typedef std::unordered_map<size_t, size_t> FrequencyMap;

bool IsPalindrome(const std::string& myStr)
{
    if (myStr.empty()) {
        return false;
    }

    const size_t len = myStr.size();
    if (len == 1) {
        return true;
    }

    for (size_t start = 0, end = len - 1; start < end; ++start, --end) {
        if (myStr[start] != myStr[end]) {
            return false;
        }
    }

    return true;
}

bool DetectFrequencyAndReturnFalseIGreaterThan2(const std::string& myStr, FrequencyMap& freqMap)
{
    for (std::string::const_iterator cIt = myStr.begin(); cIt != myStr.end(); ++cIt) {
        FrequencyMap::iterator found = freqMap.find(*cIt);
        if (found == freqMap.end()) {
            freqMap.insert(std::make_pair(*cIt, 1));
        }
        else {
            ++found->second;
            if (found->second > 2) {
                return true;
            }
        }
    }

    return false;
}

std::string StripCharsAppearingOnlyOnce(const std::string& myStr, const FrequencyMap& freqMap)
{
    std::string result;
    for (std::string::const_iterator cIt = myStr.begin(); cIt != myStr.end(); ++cIt) {
        FrequencyMap::const_iterator found = freqMap.find(*cIt);
        assert(found != freqMap.end());
        if (found->second == 2) {
            result.push_back(*cIt);
        }
    }

    return result;
}

bool FindRepeatedSequence(const std::string& myStr)
{
    FrequencyMap freqMap;
    if (DetectFrequencyAndReturnFalseIGreaterThan2(myStr, freqMap)) {
        return true;
    }

    std::string stringWithCharAppearingTwice = StripCharsAppearingOnlyOnce(myStr, freqMap);
    if (stringWithCharAppearingTwice.empty() || stringWithCharAppearingTwice.size() < 4) {
        return false;
    }

    return !IsPalindrome(stringWithCharAppearingTwice);
}

// ********************************************************************************
// TEST
// ********************************************************************************
void TestCases()
{
    assert(FindRepeatedSequence("aaa") == true);
    assert(FindRepeatedSequence("abab") == true);
    assert(FindRepeatedSequence("abba") == false);
    assert(FindRepeatedSequence("acbdaghfb") == true);
    assert(FindRepeatedSequence("abcdacb") == true);
}
// ********************************************************************************

No comments:

Post a Comment