This is an interesting problem: given a set of words, rank them by their frequency and return the k-th most frequent ones.
If K is bigger than the lowest rank, we return an empty set. Arguably, we could return the least frequent words instead.
This question quickly becomes more philosophical since a "best" solution is really tied to what real use case you intend to solve:
- are you going to parse a dataset once and execute multiple queries with different K values on it?
- are you going to parse different datasets multiple times and execute few queries with different K values on it?
- are you going to parse datasets so big the data won't reasonably fit in memory?
Here are some approaches.
Option 1: count and sort
We can see that by using extra O(N) space we are able to return a solution after processing the input a bit:
- in O(N) time we can process the input words and build a map(word, frequency)
- in O(N) time we can process that map a build another map (frequency, set of words)
- in O(M log(M)) time we can process this second map to sort it by frequency (using a reverse natural ordering comparator)
- in O(1) time we can return the desired result
Now this approach would yield O(N + M log(M)) worst case time and uses O(N) extra space.
An upside of this approach is that after this one time processing, we can query for any valid K.
Option 2: count and heap
- in O(N) time process the words and build a map(word, frequency)
- in O(N) time process that map and build another map(frequency, set of words)
- in O(M log(K)) time we add all pairs(frequency, set of words) to a max heap of maximum size K using the frequency as the comparator
- in O(log(K)) time we pop all values from the heap and get our result
Assuming K<<N we would have O(N + M log(K) + K log(K)) which is O(N + M) time and O(N) space.
An upside of this approach is the better theoretical speed but for a different value of K we need to do again O(M) work to recompute the new heap rankings, while with the previous approach we already get our solution in O(1) time.
Last note, if M<<N the theoretical time is O(N) in both cases.
You can check my implementation of option 1 in findKthMostFrequentWords on my Gist along with some tests in FindKthMostFrequentWordsJTests.
No comments:
Post a Comment
With great power comes great responsibility