Scumm Bar: [Java] Find Kth most frequent elements in given set

05/06/2021

[Java] Find Kth most frequent elements in given set

This is an interesting problem: given a set of words, rank them by their frequency and return the k-th most frequent ones.

If K is bigger than the lowest rank, we return an empty set. Arguably, we could return the least frequent words instead.

This question quickly becomes more philosophical since a "best" solution is really tied to what real use case you intend to solve:

are you going to parse a dataset once and execute multiple queries with different K values on it?
are you going to parse different datasets multiple times and execute few queries with different K values on it?
are you going to parse datasets so big the data won't reasonably fit in memory?

Here are some approaches.

Option 1: count and sort

We can see that by using extra O(N) space we are able to return a solution after processing the input a bit:

in O(N) time we can process the input words and build a map(word, frequency)
in O(N) time we can process that map a build another map (frequency, set of words)
in O(M log(M)) time we can process this second map to sort it by frequency (using a reverse natural ordering comparator)
in O(1) time we can return the desired result

Now this approach would yield O(N + M log(M)) worst case time and uses O(N) extra space.

An upside of this approach is that after this one time processing, we can query for any valid K.

Option 2: count and heap

in O(N) time process the words and build a map(word, frequency)
in O(N) time process that map and build another map(frequency, set of words)
in O(M log(K)) time we add all pairs(frequency, set of words) to a max heap of maximum size K using the frequency as the comparator
in O(log(K)) time we pop all values from the heap and get our result

Assuming K<<N we would have O(N + M log(K) + K log(K)) which is O(N + M) time and O(N) space.

An upside of this approach is the better theoretical speed but for a different value of K we need to do again O(M) work to recompute the new heap rankings, while with the previous approach we already get our solution in O(1) time.

Last note, if M<<N the theoretical time is O(N) in both cases.

You can check my implementation of option 1 in findKthMostFrequentWords on my Gist along with some tests in FindKthMostFrequentWordsJTests.