Friday 19 June 2015

Data Structure and Algorithm - Find the Maximal Non-Overlapping Subset (Microsoft Interview Question)

1. Problem Description
This is a Microsoft interview question for interns from careercup. Here is the original thread,

Find a maximum subset of S such that no pair of intervals in S' overlaps ?"

2. Analysis
Given a set S = {(starti, endi)} where i is from 1 to n, the task is to find the largest subset whose elements are not overlapped with each other. For instance {(1, 3), (2, 5), (4, 6), (7, 8)}, the largest subset is {(1, 3), (4, 6), (7, 8)}. As the original problem description is not very clear about the two ends. In this case in this blog entry assume that the interval has open two ends - exclusive for lower bound and upper bound. 

what does the result, S', look like?
All the elements in S' should not overlapp. Assuming that the size of S' as m, we sort the lower bounds and upper bounds as
    - {LB1, LB2, LB3, .... LBm} and
    - {UB1, UB2, UB3 .... UBm}
The following property should hold:
    - LB1 <  LB2 < LB3 ... < LBm
    - UB1 <  UB2 < UB3 ... < UBm
    - LB1 < UB1 <= LB2 < UB2 <= LB3 ... <= LBm < UBm
Here UBi <= LBi+1, it allows "=" because two ends are excluded. For instance (1, 3) and (3, 4) are regarded as not overlapped.

At the same time S' has to be one of subsets that have the largest elements. For instance {(1, 4), (3, 5), (7, 8)},  there are two solutions: {(1, 4), (7, 8)} or {(3, 5), (7, 8)}. S' has to be one of these solutions.
How to find (one of) the largest solution(s), we will discus later in this section.

How to detect the overlap between intervals?

The basic idea is to sort the lower bounds and upper bounds, find out how many the interval's lower bound and upper bound sweep through, and calculate the difference - which represents how many this interval overlaps with other intervals.

When find out the number of overlap of each interval, the sum of the number of overlap is always an even number. Assume the first interval overlaps with other intervals as OV1, represent the rest together as {OV1, OV2, ..., OVn}, 
    - sigma(OVi) is an even number, where i is from 1 to n,
Each overlap is counted once by two intervals.It is symmetric. For instance {(1, 4), (2, 3), these two intervals are overlapped. Each interval has an overlap number of 1 and sum them together as 2.

Now let's take another look at S' - the properties it holds. All its elements do not overlap with each other. Therefore the following statement holds
    - sigma(OVi) = 0, where i is from 1 to m and OVi >= 0.

How to guarantee S' the largest?
Let's take a look where we start and where we finish. Start with S with sigma(OVi) >= 0 (even number), where i's range is [1, n] and end with S' with sigma(OVj) = 0, where j's range is [1, m]. The objective function is quite simple max(m), It is equivalent to find the largest subset S' whose elements do not overlap.

Redefine the object function as 
    - min(n - m), where n is a const number given to the size of S
Rather to find the "m", the idea is to remove the least elements/intervals in S to reach S' and use steepest descent algorithm as guider to remove intervals from S.

How does it work with steepest descent?  
Now we know how to use sweep line algorithm to find the number of overlap for each intervals as OV1, OV2, ..., OVn for S and sigma(OVi) as SGn. Start with the interval with max(OVi),  as MAXOVn and remove it from S. Then the left set as Sn-1 have total overlap as SGn - 2*MAXOVn. Then find the number of overlap for Sn-1 and remove the intervals with maximal the number of overlap. Repeat this process until the sum of overlaps reach 0 (reach S' with sigma(OVj) = 0).
    - The cost function of steepest descent: the number of overlap of interval
    - The step of steepest descent: 1 - each time remove one interval

3. Solution
pseudo-code
1. Sort the lower bounds and upper bounds as LB and UB
2. Use sweep line algorithm to detect the number of overlap for each interval
    2.1 Track the interval with largest number of overlap during this process
    2.2 Track the sum of overlap
    2.3 For the interval with 0 overlap
        2.3.1 Remove interval's lower/upper bound from LB and UB
        2.3.2 Move the interval to S'
        2.3.3 Remove it from S 
3. Check sum of overlap
    3.1 Go to Step 4 if it is more than 0
    3.2 Go to Step 6 if it is equal to 0
4. steepest descent: remove the interval with the largest number of overlap from S
5. Remove its lower/upper bound from LB and UB and go to Step 2
6. S' is the result and terminate the program

Solution complexity
Sorting lower bound and upper bound
    - O(NlogN)
Sweep line algorithm to find the number of interval
    - O(NlogN)
    - Should be less than this because interval with 0 overlap removed from LB and UB
Repeat sweep line algorithm for N-M times as
    - N*logN + (N-1)*log(N-1) + ... + (N-M)*log(N-M)
    - Should be less than this because interval with 0 overlap removed from LB and UB
Space complexity
    - O(N)

Example

Example: find the maximal non-overlapping subset

Monday 8 June 2015

C++11 Features - Mutex and Deadlock

This is more or less a reading note of Chapter 3 of "Concurrency IN ACTION - Practical Multithreading" by Anthony Williams. As I have discussed threading - deadlock and how to ensure one instance in singleton in my previous blog entries, I cherry-pick up a few things that would be beneficial for these two issues in C++11 features.

1. std::lock_guard and std::mutex
This is one of C++ RAII techniques to solve one serious issue. I have discussed a few concrete scenarios that use RAII to prevent serious issues like memory leak, left open file handler and etc in this long entry, RAII.

We are facing the same dangers when designing multithreading code, for instance not unlocking the mutex in some corner code path or not unlocking the mutex when exception thrown. Unlocking the mutex will surely cause deadlock when try to acquire mutex in the next call. std::lock_guard is the solution to prevent these issues. It uses RAII techniques and the mutex associated with it will be automatically unlocked when std::lock_guard object goes out of scope.

C++11 also provides a few variants that are combined with other C++11 features. std::unique_lock to combine with the std::move feature and std::shared_lock to combine the resource management and shared owership.

2. std::shared_mutex and std::shared_lock
This pair are perfect for reader-writer mutex. If anyone familiar with QT mutex or boost mutex, understanding them is not a big deal. It is perfectly designed for one writer thread and multiple reader threads. In order to remove the bottleneck of reading threads, it should just give green lights to reader threads if shared resource is not acquired by the writer thread. Otherwise wait until the writer thread finishes its business.

3. std::lock
This is the biggest cherry I found in this chapter. I have discussed how to avoid deadlock of multiple level/nested locking access in Threading - Deadlock. The techniques are to avoid the circle locking in the design stage and to always lock multiple mutexes in the certain order.

std::lock provides a solution to lock multiple mutexes all-in-once. It takes N locks as parameters and always lock them at once. It is a nothing-or-all operation. Either all N locks are locked at once or none of them is locked if attempt fail.

4. std::call_once
I have discussed the difficulty that we are facing to make sure the uniqueness of singleton in Design pattern - Singleton pattern. Now C++11 provides the exact tool to achieve this. It guarantees that the function passed into it will be only invoked once in multithreading applications.

It is particular important for lazy initialization of C++. Multiple threads race to initialize the data shared between them when they first time come into use. std::call_once will make the lazy initialization in multithreading much easier to be done.

5. Other thing worth mentioning
Still no solution to the deadlock caused by the mismanagement of threading. For instance two worker threading waiting for the counterpart to join. An excellent example (linked list operation) is provided to explain that design thread-safe code is not about the function itself but also its API design.

Thursday 4 June 2015

Data Structure and Algorithm - Find the Overlapping Tasks of Job Schedules (Amazon Interview Question)

1. Problem Description
This is an Amazon interview question for software develop engineer from careercup. Here is the original thread,

"
Given set of job schedules with start and end time, write a function that returns indexes of overlapping sets. 

for ex :-
input -> [1,2][5,6][1,5][7,8][1,6]
return -> [0,1,2,4]".
2. Analysis
My solution to this problem is to find out all the overlapping intervals and then find each overlapping intervals its crossing tasks. In the above example the overlapping intervals are: [1, 6] and [7, 8]. Then find out which tasks they come cross. [1, 6] crosses the first, second, third and fifth tasks. and [7, 8] crosses the 4th task.

To find out how many overlapping intervals is very close to this Google interview question. The only difference is that this problem does not have CONTINUOUS case as the Google interview question. For example [1, 5] and [6, 9] in this case they are not overlapped but they are in Google interview question.

As I introduced the sweep line algorithm in this blog entry, Data Structure and Algorithm - Sweep Line Algorithm and I also use it solve this Google interview question. Here it works too to find out the overlapping intervals.

Problem:
Given N intervals, [X1, Y1], [X2, Y2], ..., [Xn, Yn], where
    - Xi and Yi are all integers with i's range from 1 to N, and
    - the intervals are all inclusive on both ends.

Properties of the Sorted Lower and Upper Bounds:
According to the blog entry, Data Structure and Algorithm - Sweep Line Algorithm, we group the lower bounds and the upper bounds, {X1, X2, ..., Xn} and {Y1, Y2, ..., Yn}. Now we sort both lower bounds and upper bounds in the ascending order as,
    - Lower bounds: {L1, L2, ..., Ln}
    - Upper bounds: {U1, U2, ..., Un}
Because the lower and upper bounds are sorted, the following properties hold,
    - L1 <= L2 <= ... <= Ln
    - U1 <= U2 <= ... <= Un
    - Li <= Ui, for any given i between 1 to N
    - L1 <= L2 <= ... <= Li <= Ui, for any given i betwen 1 to N

Detect the Overlap
Given the sorted lower and upper bounds,
    - Lower bounds: { L1, l2, ...,   Li, Li+1, ...., Ln}
    - Upper bounds: { U1, U2, ..., Ui, Ui+1, ..., Un]

Let's consider if Ui and Li+1 are overlapped, Based on the properties of the sorted lower and upper bounds, Ui is the largest value of {L1, U1, L2, U2, ..., Li, Ui} and Li+1 is the least value of { Li+1, Ui+1, Li+2, and Ui+2, ..., Ln, Un}. Therefore let's examine their relationship,
    - If Li+1 > Ui, then there is NO OVERLAP here,
        * An interval with upper bound of Ui should stop here.
        * A new interval with the lower bound of Li+1 should start here
    - Otherwise there is a OVERLAP

How to find out what tasks the overlapped intervals cross?
Use a hash map to contain the overlapped intervals. Use lower bound of the overlapped intervals as the key and a vector as the value to save the tasks that it crossed.
    - Key: int (lower bounds of overlapped intervals)
    - Value: std::vector<size_t> (to save the indices of tasks that this overlapped interval crosses)

Sort the lower bounds of all overlapped intervals as LB. Then go through the tasks. Search the lower bound of each task lb in LB and find the first value that, ul', is less or equal t ul. Then save the index of this task into the hash map with key value of ul'.

3. Solution
Example:
Let's take a look at the example: [1, 2], [5, 6], [1, 5], [7, 8], [1, 6]
The sorted lower and upper bounds are,
    - Lower bounds: {1, 1, 1, 5, 7}
    - Upper bounds: {2, 5, 6, 6, 8}
Here is the algorithm runs
    - Take 1 as the lower bound of the first interval
    - Ui = 2, Li+1 = 1, where  i = 1. it is a OVERLAP case and continue
    - Ui = 5, Li+1 = 1, where i = 2, it is a OVERLAP case and continue
    - Ui = 6, Li+1 = 5, where i = 3, it is a OVERLAP case and continue
    - Ui = 6, Li+1 = 7, where i = 4, it is NOT a OVERLAP case
        * The first interval stops her as [1, 6]
        * The second interval starts here with lower bound as 7
    - Reach the end, construct the second interval with the upper bound of Un, [7, 8]
Two overlapped intervals generated as [1, 6] and [7, 8]
Insert key 1 with empty vector and key 7 with empty vector into hash map,
Order the overlapped interval's lower bounds as {1, 7}. Go through the lower bound of each interval and find the first value (in this case {1, 7}) which it is equal to or greater than.
    - Task [1, 2]: the 1st value 1 >= 1 is 1. then append 0 into the value of Key 1 of hash map
    - Task [5, 6]: the 1st value 5 >= 1 is 1. then append 1 into the value of Key 1 of hash map
    - Task [1, 5]: the 1st value 1 >= 1 is 1. then append 2 into the value of Key 1 of hash map
    - Task [7, 8]: the 1st value  7 >= 7 is 7. then append 3 into the value of Key 7 of hash map
    - Task [1, 6]: the 1st value 1>= 1 is 1. then append 4 into the value of Key 1 of hash map
Go through the hash map,
    [0, 1, 2, 4] in one group
    [3] in one group

Pseudo-code:
    - 1. Group lower bounds and upper bounds as {X1, ..., Xn} and {Y1, ..., Yn}
    - 2. Sort lower and upper bounds in ascending order as {L1, ..., Ln} and {U1, ..., Un}
    - 3. Take LowerBound = L1 as the lower bound of the first interval
    - 4. Compare Li+1 and Ui where i ranges from 1 to N-1
        * 4.1 if Li+1 > Ui, then 
                 construct the interval with Ui as the upper bound, [LowerBound, Ui]
                 update LowerBound = Li+1
                 go to Step 4.3
        * 4.2 if i = N - 1, then
                 construct the interval with Un as the upper bound, [LowerBound, Un]
                 go to Step 5
        * 4.3 Increment i by 1 and repeat Step 4
    - 5. Sort the lower bounds of overlapped interval as LB
    - 6. construct hash map <int, std::vector<size_t>> with LB and empty vectors
    - 7. Go through the tasks list
        * Find the first value x which is <= the lower bound of this task
        * hashMap[x].push_back(indexOfTask)
        * Repeat 7 until exhaust the task list
    - 8. Go through the hash map. The value (std::vector) of each entry are grouped.         

This solution has O(NlogN) computation complexity and O(N) space complexity.