Sunday 2 November 2014

Data Structure and Algorithm - Reversing a Sub-string to Get the Least Lexicographical String

1. Problem Description
This is a Google Interview Question SDE-2s from careerup. The following is the original description from the website.

"You are given a string S. Each character of S is either ‘a’, or ‘b’. You wish to reverse exactly one sub-string of S such that the new string is lexicographically smaller than all the other strings that you can get by reversing exactly one sub-string. 
For example, given ‘abab’, you may choose to reverse the substring ‘ab’ that starts from index 2 (0-based). This gives you the string ‘abba’. But, if you choose the reverse the substring ‘ba’ starting from index 1, you will get ‘aabb’. There is no way of getting a smaller string, hence reversing the substring in the range [1, 2] is optimal. 

Input: 
First line contains a number T, the number of test cases. 
Each test case contains a single string S. The characters of the string will be from the set { a, b }. 

Output: 
For each test case, print two numbers separated by comma; for example “x,y” (without the quotes and without any additional whitespace). “x,y” describe the starting index (0-based) and ending index respectively of the substring that must be reversed in order to acheive the smallest lexicographical string. If there are multiple possible answers, print the one with the smallest ‘x’. If there are still multiple answers possible, print the one with the smallest ‘y’. 

Constraints: 
1 ? T ? 100 
1 ? length of S ? 1000 

Sample Input: 

abab 
abba 
bbaa 
aaaa 
babaabba 

Sample Output: 
1,2 
1,3 
0,3 
0,0 
0,4"

2. Solution
This solution has algorithm complexity of O(N) and space complexity of O(1). The fundamental idea is that the result string will have as many 'a's as possible in the left hand after reversing a sub-string, because the lexicographical comparison starts from the left to the right and stops when the first character is not equal.

The start of the sub-string
This is reasonably easy question. The answer is that the sub-string starts from the first 'b' in the string from the left hand direction, if there is 'a' existing afterwards. Otherwise starts from 0.
If there is no 'b' existing in the string, then the string consists of only 'a's. Therefore the string is already at the least lexicographical state and the sub-string {0, 0} should be return according to the problem requirement - return the sub-string with the least indices if multiple choices exist. For instance 'aaaa', there is no 'b' and reversing any sub-strings will have the same result. According to the requirement the first character {0, 0} is the correct answer.
If there is 'b' exiting, then the very first 'b' from the left hand side is the starting point of the sub-string. For instance ''aaabbaabbbabbab', then the sub-string should start from 3 (the index of string starts from 0).

The end of the sub-string
It must end with 'a' that appears after the start of the sub-string, if such 'a' exists. Otherwise ends at 0. Let's take a look at a few scenarios.

Scenario 1: there is no such 'a' after the first 'b'
If there is no 'a' existing after the very first 'b', then this string is already at its least lexicographical state, Therefore the program should return {0, 0}. For instance 'aabbbb', the very first 'b' is at position 2 and afterwards there is no 'a' exits. And this string is already at the least lexicographical state, and no reversing can possibly make it lexicographically less.

Scenario 2: there is only one continuous 'a's appearing after the first 'b'
The sub-string will be starts from the very first 'b' and ends at the last 'a'. For instance 'aabbaaabbb', the very first 'b' is at position 2 and the last 'a' after the first 'b' is at position 6, 'bbaaa'. Reversing this sub-string {2, 6} will have 'aaaaabbbbb', which is the sort of string as Scenario 1 that is already at the least lexicographical state. So the correct answer for this scenario is {firstB, lastAAfterFirstB}.

Scenario 3: multiple continuous 'a's after the first 'b' but only one instance with most 'a's
There can be any number of continuous 'a's appearing after the first 'b'. These continuous 'a's consist of any number of 'a's but only one continuous 'a's have the most 'a's. For instance,
    'aabbaaabbbaaaabbbbaaaaaabb'
    - The first 'b': at position 2
    - 3 continuous 'a's after the first 'b': 'aaa', 'aaaa' and 'aaaaaa'
    - The continuous 'a' having the most 'a's are 'aaaaaa'. With 6 'a's
Now we have 3 candidates of sub-strings
    'bbaaa'                            
    'bbaaabbbaaaa'
    'bbaaabbbaaaabbbbaaaaaa'
And the last candidate is what we want, because after reversing it will have longest 'a's on the left hand side of continous 8 'a's. Other two candidates will have  5 and 6. Therefore the correct answer for this scenario is {firstB, lastAOfLongestContinuousA_AfterFirstB}. In the above example {2, 23}.

Scenario 4: multiple continuous 'a's after the first 'b' and multiple instances with most 'a's
All the above scenarios can be easily achieved with O(N) algorithm complexity and O(1) space complexity. For this scenario that I am most struggled with is to find a solution with O(N) algorithm complexity.
Initially I was thinking of recording all the instances with most 'a's, then go thought all the candidates and find the one that reversing can result the least lexicographical string (or find the one with the least indices if multiple candidates have the same result). But in this way for some extreme cases it won't guarantee O(N) algorithm complexity, for instance
    'bababababababababa'
    - the very fist 'b' in the left at position 0
    - After the very fist 'b', all the continuous 'a's have size of 1.
    - There are N/2 candidates (N is the length of the string), record them
    - Compare the N/2 candidates and find out the correct one
    - It is a O(N^2) solution

A solution with O(N^2) algorithm complexity is not what I was targeting obviously. After carefully examining this cases, a linear solution found. The key idea behind this solution is to how to pick the correct candidate (the candidate has the same length and most 'a's). Here is the pseudo-code
    1. Go through each character in the reverse direction of the string until firstB (include firstB)
    2. Record the startIndex, the numberOfA and LenToNextContinousA of the longest continuous 'a's
    3. Keep update the startIndex, numberOfA and LenToNextContinousA, if found a longer one
    4. If found 2nd continuous 'a' with the same number 'a'. Compare the following LenToNextContinousA - numberOfA character with str[startIndex - LenToNextContinousA + 1, startIndex - numberOfA]. If less, then update startIndex, LenToNextContinousA to the 2nd continuous 'a'.
    5. Until hit firstB.
Step 4 is to decide to pick up the candidate. Here is the explanation. Considering two adjacent two continuous 'a's with longest 'a's found so far.
    "a...absssssbaaaabxxxxxxbaaaabzzzzz", suppose
    reverse(xxxxxxxx) = yyyyyyyy
    reverse(sssss) = ooooo, then we have two candicates
    Result 1: a...aaaaabooooobbxxxxxxbaaaab....  <= Candidate 1
    Result 2: a...aaaaabyyyyyybaaaabsssssbb....    <= Candidate 2
Compare Result 1 and Result 2, then the two lexicographical equality is decided by the string marked as red color booooobb and  byyyyyyb. Because in the purple color part Result 2 is always less than Result 1, because xxxx can not be aaaa. Therefore we can conclude,
    - If  booooobb < byyyyyyb, then Result 1 < Result 2 and take Candidate 1
    - If  booooobb > byyyyyyb, then Result 2 < Result 1 and take Candidate 2
    - If  booooobb = byyyyyyb, then Result 2 < Result 1 and take Candidate 2

So far now we have gone through all the scenarios and the sub-string can be found in a single linear search on the string. It is a linear solution and requires constant space.

3. C++ Implementation
// Implementation
// ********************************************************************************
#include <string>

#include <cassert>

struct AabbResult {
    AabbResult(int s, int e)
        : start(s), end(e)
    {}

    int start;
    int end;
};

AabbResult AabbSolution(const std::string& str)
{
    if (str.empty()) {
        return AabbResult{ -1, -1 };
    }

    size_t firstB = str.find_first_of('b');
    if (firstB == std::string::npos) {
        // all are 'a', so need to reverse
        return AabbResult{ 0, 0 };
    }

    bool foundA = false;
    size_t rFirstA = 0;
    size_t rNextA = 0;
    size_t MaxNumOfContinousA = 0;
    for (size_t index = str.length() - 1; index > firstB; --index)
    {
        if (!foundA && rFirstA == 0 && str[index] == 'a') {
            foundA = true;
            rFirstA = index;
            continue;
        }

        if (foundA && str[index] != 'a') {
            // find the first continuous 'a' in the reverse order 
            // and record its length
            foundA = false;
            MaxNumOfContinousA = rFirstA - index;
        }

        if (!foundA && rFirstA > 0 && str[index] == 'a') {
            // find the start of the 2nd continuous 'a' in the
            // reverse order
            rNextA = index;
            break;
        }
    }

    if (rFirstA == 0) {
        // not found any 'a' after first 'b'
        // eg: aaabbbb
        return AabbResult{ 0, 0 };
    }

    if (rNextA == 0) {
        // only one continuous 'a' between firstB and rFirstA
        // aaabbbaaaabb
        return AabbResult( static_cast<int>(firstB),
                           static_cast<int>(rFirstA) );
    }

    // There are multiplie multipile continuous 'a' after firstB
    // eg: aaabbbaaabbaaabbaabb
    size_t strLenBetweenLongestA = 0;
    size_t rLongestAIndex = rFirstA;
    size_t numOfContinuousA = 0;
    bool   sameMaxLengthContinousAFound = false;
    size_t numToContinueComparing = rFirstA - rNextA;
    foundA = false;
    size_t tempIndex = 0;
    for (size_t index = rNextA + 1; index > firstB; --index, ++tempIndex) {
        if (sameMaxLengthContinousAFound) {
            // Step 4 on Scenario 4
            if (numToContinueComparing > 0) {
                if (str[index - 1] > str[rLongestAIndex - MaxNumOfContinousA - tempIndex]) {
                    // keep the continuous 'a' as it is still the least lexicographical one
                    // Case of booooobb > byyyyyyb
                    sameMaxLengthContinousAFound = false;
                }
                else if (str[index - 1] < str[rLongestAIndex - MaxNumOfContinousA - tempIndex]) {
                    // Case of booooobb < byyyyyyb         
                    // The newly found continuous 'a's is the least lexicographical one. 
                    // Update the record
                    rLongestAIndex = rFirstA;
                }

                --numToContinueComparing;
            }
            else {
                // keep the continuous 'a' as it is still the least lexicographical one
                // Case of booooobb = byyyyyyb
                sameMaxLengthContinousAFound = false;
            }
        }

        if (!foundA && str[index - 1] == 'a') {
            strLenBetweenLongestA = rLongestAIndex - (index - 1);
            rNextA = index - 1;
            numOfContinuousA = 1;
            foundA = true;
            continue;
        }

        if (foundA) {
            if (str[index - 1] == 'a') {
                ++numOfContinuousA;
                if (numOfContinuousA > MaxNumOfContinousA) {
                    // found a even longer continuous 'a's
                    // clear the flag to cancel the comparison of Step 4 on Scenario 4
                    sameMaxLengthContinousAFound = false;
                }
            }
            else {
                if (numOfContinuousA > MaxNumOfContinousA) {
                    // found a even longer continuous 'a's
                    // update the record
                    rLongestAIndex = rNextA;
                    MaxNumOfContinousA = numOfContinuousA;
                    sameMaxLengthContinousAFound = false;
                }
                else if (numOfContinuousA == MaxNumOfContinousA) {
                    // found two continuous 'a' with the same number of 'a'
                    // Set the flag to start the comparison of Step 4 on Scenario 4
                    sameMaxLengthContinousAFound = true;
                    numToContinueComparing = strLenBetweenLongestA - MaxNumOfContinousA;
                    tempIndex = 0;
                    rFirstA = rNextA;
                }

                foundA = false;
            }
        }

    }

    if (sameMaxLengthContinousAFound) {
        assert(numToContinueComparing > 0);
        // found two continuous 'a' with the same number of 'a'
        // not finished Step 4 on Scenario 4 yet before hitting firstB
        // conitnue to the reversed part
        for (size_t index = 0; index < numToContinueComparing; ++index, ++tempIndex) {
            if (str[rFirstA + index + 1] >
                              str[rLongestAIndex - MaxNumOfContinousA - tempIndex]) {
                // Case of booooobb > byyyyyyb, and
                // Case of booooobb = byyyyyyb
                return AabbResult{ static_cast<int>(firstB),
                                   static_cast<int>(rLongestAIndex) };
            }
            else if (str[rFirstA + index + 1] <
                             str[rLongestAIndex - MaxNumOfContinousA - tempIndex]) {
                // Case of booooobb < byyyyyyb
                return AabbResult{ static_cast<int>(firstB),
                                   static_cast<int>(rFirstA) };
            }
        }
    }

    return AabbResult{ static_cast<int>(firstB),
                       static_cast<int>(rLongestAIndex) };
}

// ********************************************************************************
// Test
// ********************************************************************************
void TestCaseFromCareerup()
{
    AabbResult result = AabbSolution("");
    assert(result.start == -1 && result.end == -1);
     
    result = AabbSolution("abab");
    assert(result.start == 1 && result.end == 2);

    result = AabbSolution("abba");
    assert(result.start == 1 && result.end == 3);
 
    result = AabbSolution("bbaa");
    assert(result.start == 0 && result.end == 3);
 
    result = AabbSolution("aaaa");
    assert(result.start == 0 && result.end == 0);
 
    result = AabbSolution("bbbbbbbbbbbbbbbb");
    assert(result.start == 0 && result.end == 0);

    result = AabbSolution("aaaaaaaaaaaaaa");
    assert(result.start == 0 && result.end == 0);
}

void TestCasesSingleHighestContinousA()
{
    AabbResult result = AabbSolution("babaabba");
    assert(result.start == 0 && result.end == 4);

    result = AabbSolution("babaabbaaaaabbbbaaaaaaabbbaabbaabababab");
    assert(result.start == 0 && result.end == 22);

    result = AabbSolution("aababaabbaaaaabbbbaaaaaaabbbaabbaabababab");
    assert(result.start == 2 && result.end == 24);
}

void TestCasesMultipleHighestContinuousAs()
{
    AabbResult result = AabbSolution("abbaaabbbbbbaaa"); // abbbaaabbbbbbaaa
    assert(result.start == 1 && result.end == 14);
    result = AabbSolution("abbbaaabbbbbbaaa");
    assert(result.start == 1 && result.end == 15);
    result = AabbSolution("abbaaababbbbaaa");
    assert(result.start == 1 && result.end == 5);
    result = AabbSolution("baabaabbaa");
    assert(result.start == 0 && result.end == 5);

    // a babaaa babaaa babaaa babaaa babaaa
    result = AabbSolution("ababaaababaaa");
    assert(result.start == 1 && result.end == 12);
    result = AabbSolution("ababaaababaaababaaa");
    assert(result.start == 1 && result.end == 18);
    result = AabbSolution("ababaaababaaababaaababaaa");
    assert(result.start == 1 && result.end == 24);
    result = AabbSolution("ababaaababaaababaaababaaababaaa");
    assert(result.start == 1 && result.end == 30);
}

// ********************************************************************************

No comments:

Post a Comment