21/11/2017

[Java] Merge sorted files

Here is a problem that takes more time to code than to figure out: given N files where each line contains a timestamp (as long) and some data sorted in ascending order, merge them in a single file also sorted in ascending order.
The files can be arbitrarily big and are passed in input as a full path list; each has content in the format:

timestamp#data

Since the files can be arbitrarily big, we might want to avoid loading them all into memory at once; we can instead keep an open pointer to each of them using O(N) additional space.

This is not enough though, since we also need to write entries to the output file in ascending order too, which would be a big issue if the input files were themselves not ordered already. As long as they are already ordered, we can avoid using a mergesort modification for example and rely on a min heap (aka PriorityQueue) instead.

We will read one line from each file and place it in the min heap which will be in size of O(N) as well and will have a O(log N) handling complexity. To understand later on which entry corresponds to which file and pointer, we use a custom structure FileEntry as object for the heap, meaning we override equals, hashCode and compareTo and we also use a HashMap from filename to BufferedReader so that we can quickly read the next line from each input file once we determine from which one we need to read next.

We keep looping over the heap until it's empty, meaning we read all the lines and also wrote them out. For each element we remove from the heap, we add, if any, a new element (the next line to read) from the same source file; if the file is finished, we remove the reader from the map.

Total runtime is O(N log N) and total space is O(N).

The hardest part here is actually the handling of file IOExceptions, as we might have many BufferedReaders active at the same time and need to gracefully try to close all of them if we encounter an error during processing. The algorithm itself is quite short, but there is a lot of file handling code around it unfortunately; also some code was added to have a more graceful error handling, but it is not strictly necessary.

You can check the implementation of mergeSortedFiles on my Gist along with some test cases (STUB!) in MergeFilesJTests. This time I was too lazy to properly create JUnit tests for these File operations, so please check out the source and edit it as needed, also do NOT run ALL tests together: you will get a "file already exists" exception since the out file is created but never automatically deleted.

No comments:

Post a Comment

With great power comes great responsibility