Pages

05/06/2021

[Java] Find Kth most frequent elements in given set

This is an interesting problem: given a set of words, rank them by their frequency and return the k-th most frequent ones.

If K is bigger than the lowest rank, we return an empty set. Arguably, we could return the least frequent words instead. 

This question quickly becomes more philosophical since a "best" solution is really tied to what real use case you intend to solve:

  • are you going to parse a dataset once and execute multiple queries with different K values on it?
  • are you going to parse different datasets multiple times and execute few queries with different K values on it?
  • are you going to parse datasets so big the data won't reasonably fit in memory?

Here are some approaches.

Option 1: count and sort

We can see that by using extra O(N) space we are able to return a solution after processing the input a bit:

  • in O(N) time we can process the input words and build a map(word, frequency)
  • in O(N) time we can process that map a build another map (frequency, set of words)
  • in O(M log(M)) time we can process this second map to sort it by frequency (using a reverse natural ordering comparator)
  • in O(1) time we can return the desired result

Now this approach would yield O(N + M log(M)) worst case time and uses O(N) extra space.

An upside of this approach is that after this one time processing, we can query for any valid K.

 

Option 2: count and heap

  • in O(N) time process the words and build a map(word, frequency)
  • in O(N) time process that map and build another map(frequency, set of words)
  • in O(M log(K)) time we add all pairs(frequency, set of words) to a max heap of maximum size K using the frequency as the comparator
  • in O(log(K)) time we pop all values from the heap and get our result

Assuming K<<N we would have O(N + M log(K) + K log(K)) which is O(N + M) time and O(N) space.

An upside of this approach is the better theoretical speed but for a different value of K we need to do again O(M) work to recompute the new heap rankings, while with the previous approach we already get our solution in O(1) time.

Last note, if M<<N the theoretical time is O(N) in both cases.

You can check my implementation of option 1 in findKthMostFrequentWords on my Gist along with some tests in FindKthMostFrequentWordsJTests.

No comments:

Post a Comment

With great power comes great responsibility