Friday, 19 June 2015

Data Structure and Algorithm - Find the Maximal Non-Overlapping Subset (Microsoft Interview Question)

1. Problem Description
This is a Microsoft interview question for interns from careercup. Here is the original thread,

Find a maximum subset of S such that no pair of intervals in S' overlaps ?"

2. Analysis
Given a set S = {(starti, endi)} where i is from 1 to n, the task is to find the largest subset whose elements are not overlapped with each other. For instance {(1, 3), (2, 5), (4, 6), (7, 8)}, the largest subset is {(1, 3), (4, 6), (7, 8)}. As the original problem description is not very clear about the two ends. In this case in this blog entry assume that the interval has open two ends - exclusive for lower bound and upper bound. 

what does the result, S', look like?
All the elements in S' should not overlapp. Assuming that the size of S' as m, we sort the lower bounds and upper bounds as
    - {LB1, LB2, LB3, .... LBm} and
    - {UB1, UB2, UB3 .... UBm}
The following property should hold:
    - LB1 <  LB2 < LB3 ... < LBm
    - UB1 <  UB2 < UB3 ... < UBm
    - LB1 < UB1 <= LB2 < UB2 <= LB3 ... <= LBm < UBm
Here UBi <= LBi+1, it allows "=" because two ends are excluded. For instance (1, 3) and (3, 4) are regarded as not overlapped.

At the same time S' has to be one of subsets that have the largest elements. For instance {(1, 4), (3, 5), (7, 8)},  there are two solutions: {(1, 4), (7, 8)} or {(3, 5), (7, 8)}. S' has to be one of these solutions.
How to find (one of) the largest solution(s), we will discus later in this section.

How to detect the overlap between intervals?

The basic idea is to sort the lower bounds and upper bounds, find out how many the interval's lower bound and upper bound sweep through, and calculate the difference - which represents how many this interval overlaps with other intervals.

When find out the number of overlap of each interval, the sum of the number of overlap is always an even number. Assume the first interval overlaps with other intervals as OV1, represent the rest together as {OV1, OV2, ..., OVn}, 
    - sigma(OVi) is an even number, where i is from 1 to n,
Each overlap is counted once by two intervals.It is symmetric. For instance {(1, 4), (2, 3), these two intervals are overlapped. Each interval has an overlap number of 1 and sum them together as 2.

Now let's take another look at S' - the properties it holds. All its elements do not overlap with each other. Therefore the following statement holds
    - sigma(OVi) = 0, where i is from 1 to m and OVi >= 0.

How to guarantee S' the largest?
Let's take a look where we start and where we finish. Start with S with sigma(OVi) >= 0 (even number), where i's range is [1, n] and end with S' with sigma(OVj) = 0, where j's range is [1, m]. The objective function is quite simple max(m), It is equivalent to find the largest subset S' whose elements do not overlap.

Redefine the object function as 
    - min(n - m), where n is a const number given to the size of S
Rather to find the "m", the idea is to remove the least elements/intervals in S to reach S' and use steepest descent algorithm as guider to remove intervals from S.

How does it work with steepest descent?  
Now we know how to use sweep line algorithm to find the number of overlap for each intervals as OV1, OV2, ..., OVn for S and sigma(OVi) as SGn. Start with the interval with max(OVi),  as MAXOVn and remove it from S. Then the left set as Sn-1 have total overlap as SGn - 2*MAXOVn. Then find the number of overlap for Sn-1 and remove the intervals with maximal the number of overlap. Repeat this process until the sum of overlaps reach 0 (reach S' with sigma(OVj) = 0).
    - The cost function of steepest descent: the number of overlap of interval
    - The step of steepest descent: 1 - each time remove one interval

3. Solution
pseudo-code
1. Sort the lower bounds and upper bounds as LB and UB
2. Use sweep line algorithm to detect the number of overlap for each interval
    2.1 Track the interval with largest number of overlap during this process
    2.2 Track the sum of overlap
    2.3 For the interval with 0 overlap
        2.3.1 Remove interval's lower/upper bound from LB and UB
        2.3.2 Move the interval to S'
        2.3.3 Remove it from S 
3. Check sum of overlap
    3.1 Go to Step 4 if it is more than 0
    3.2 Go to Step 6 if it is equal to 0
4. steepest descent: remove the interval with the largest number of overlap from S
5. Remove its lower/upper bound from LB and UB and go to Step 2
6. S' is the result and terminate the program

Solution complexity
Sorting lower bound and upper bound
    - O(NlogN)
Sweep line algorithm to find the number of interval
    - O(NlogN)
    - Should be less than this because interval with 0 overlap removed from LB and UB
Repeat sweep line algorithm for N-M times as
    - N*logN + (N-1)*log(N-1) + ... + (N-M)*log(N-M)
    - Should be less than this because interval with 0 overlap removed from LB and UB
Space complexity
    - O(N)

Example

Example: find the maximal non-overlapping subset

No comments:

Post a Comment