Once again it’s the college recruiting season of the year and tech companies started the interview process for full time and internship positions. I had many interviews last year these days for a summer internship. Eventually I was an intern at Microsoft Bing, and will be joining there full time next summer. I won’t have any interviews this year, but since most of my friends are actively preparing for them nowadays, I thought it would be useful to share some good quality interview questions and provide my solutions. I come across this particular question pretty often recently: Given an integer array, output all pairs that sum up to a specific value k.

Let’s say the array is of size N. The naive way to solve the problem, for each element checking whether k-element is present in the array, which is O(N^2). This is of course far from optimal and you might not want to mention it during an interview as well. A more efficient solution would be to sort the array and having two pointers to scan the array from the beginning and the end at the same time. If the sum of the values in left and right pointers equals to k, we output the pair. If the sum is less than k then we advance the left pointer, else if the sum is greater than k we decrement the right pointer, until both pointers meet at some part of the array. The complexity of this solution is O(NlogN) due to sorting. Here is the Python code:

def pairSum1(arr, k): if len(arr)<2: return arr.sort() left, right = (0, len(arr)-1) while left<right: currentSum=arr[left]+arr[right] if currentSum==k: print arr[left], arr[right] left+=1 #or right-=1 elif currentSum<k: left+=1 else: right-=1

Most of the array based interview questions can be solved in O(NlogN) once we sort the input array. However, interviewers would generally be expecting linear time solutions. So let’s find a more optimal O(N) solution. But first we should clarify a detail with the interviewer, what if there is more than one copy of the same pair, do we output it twice? For example the array is [1, 1, 2, 3, 4] and the desired sum is 4. Should we output the pair (1, 3) twice or just once? Also do we output the reverse of a pair, meaning both (3, 1) and (1, 3)? Let’s keep the output as short as possible and print each pair only once. So, we will output only one copy of (1, 3). Also note that we shouldn’t output (2, 2) because it’s not a pair of two distinct elements.

The O(N) algorithm uses the set data structure. We perform a linear pass from the beginning and for each element we check whether k-element is in the set of seen numbers. If it is, then we found a pair of sum k and add it to the output. If not, this element doesn’t belong to a pair yet, and we add it to the set of seen elements. The algorithm is really simple once we figure out using a set. The complexity is O(N) because we do a single linear scan of the array, and for each element we just check whether the corresponding number to form a pair is in the set or add the current element to the set. Insert and find operations of a set are both average O(1), so the algorithm is O(N) in total. Here is the code in full detail:

def pairSum2(arr, k): if len(arr)<2: return seen=set() output=set() for num in arr: target=k-num if target not in seen: seen.add(num) else: output.add( (min(num, target), max(num, target)) ) print '\n'.join( map(str, list(output)) )

I came up to htable solution in the first place. The complexity is O(n+m) where n is input size, m amount of htable keys. Which is linear.

Keep your work going. Nice start.

Arden,

A great solution to a problem that’s seen on many interview routes! Well done! Again, I appreciate the way you present the least optimal solutions first and slowly lead towards the one that’s optimal. This is a great interview strategy too. Very nice!

@George I think complexity is

`O(n * n/m)`

where m is # keys and n is # elements. Assume m=1, now you have n^2 right?@Arden, I think Python “set” is not always O(1) on find and insert as documented here. http://wiki.python.org/moin/TimeComplexity If you can somehow instantiate the set with specifying number of keys then you can choose m=n and achieve worst case O(1).

However if you just instantiate it as

`set()`

and do not avoid duplicate pairs and assume (x,y)!=(y,x) then underlying Python “s”et implementation needs to do bucketing which can lead to O(n) for find in worst case, as documented.I think currently not possible to specify # of keys in hash table that is staying under set implementation. Python can be problematic, however Java also maintains hash table with “load factor”. We should definitely implement our own hash table… :) What do you think?

You’re totally right Ahmet. As the load factor of the set increases, the worst case complexity of a single operation becomes linear. But I would assume that after a certain load factor python would resize the set by doubling its size. So, the average time for an operation would still be amortized O(1), but still for some elements it can be O(N) in the worst case as you said. However, during an interview I suppose it’s safe to assume O(1) for operations on sets and hashtables.

Implementing our own hashtable is a great idea. In my web search course last semester, I remember searching for a hashtable implementation where you can give the size as hint during construction, so that it would perform less resize operations. Because I already knew that I’ll insert millions of elements while implementing a search engine. I think the default size of a dictionary in python is 8, and the load factor threshold for resizing is 2/3. The size is multiplied by 4 during resizing unless the hashtable already big (50,000), otherwise it doubles the size.

@Arden

If I am using C++ STL Set ,

1. Insert takes logarithmic time , but it is amortized constant.

2. Find takes logarithmic time.

So I think it would be O(nlogn) rather than O(n)

The worst case complexity of find in set can be as bad as O(N) as Ahmet mentioned above. But I think it’s safe to use the average case constant complexity for sets and hashtables during an interview, by mentioning the worst case behavior. To be technically precise I should write Omega(N) since big-O is the worst case bound, but these articles are intended to focus more on common interview practices. But you’re right, the very worst case complexity using C++ STL set is O(NlogN). But I don’t think interviewers will object to O(N) as long as you mention the worst case, that’s my experience at least.