Scumm Bar: [Java] Reconstruct sentence from text

21/07/2021

[Java] Reconstruct sentence from text - word break

Given a string representing a text and a dictionary of words, return a possible reconstruction of a sentence built by breaking off the given text using whitespaces to separate valid words.

Example:

text = "bedbath", words = {bed, bath} return "bed bath"

text = "bedbath", words = {bed} return ""

My initial idea used a Aho-Corasik trie which I though would give a linear time solution by greedily matching longest prefixes, then falling back to shorter ones while walking on the string.

This is partially true unfortunately as it works when prefixes are contained within each other, however it fails when the last character of a word is the start character of another word, and we have a fail link between them.

In that case, we would incorrectly assume we have matched two words (technically true) but in reality we matched two substrings. The shared character cannot be used for both words.

Example: text = "catsand", words = {cats, sand} return ""

If we match until "catS", then we can't continue, however we can follow the fail link to root and continue on the "S" edge for "Sand". We mistakenly assume that "S" appeared twice.

So it's fallback to DP, where we apply a similar approach to the trie matching BUT we break off any valid word we encounter and continue matching from the NEXT character onward.

Therefore for each character in the string, we have two choices: break off a word here (if the current suffix forms a dictionary word) or expand the current suffix and break later.

This translates to a O(2^N) time solution. We can however use O(N) cache space to bring the cost down to O(N^2) if we cache the indexes from where we have found in the past that no valid break can be made. This means we explore all pairs of prefix-postfix once.

If we want to reconstruct the sentence and not just return a boolean saying whether the break can be done, we can use StringBuilder objects to track the matched words up to a certain point, discarding their content if we reach a failure node or appending if we can progress with the matching.

In the end, we will have the text that is broken down to the shortest words possible to make it.

Example:

string = aaab
words = a, aa, aaa

recursion tree looks like:

aaab
a,0 aa,1 aaa,2

a,1 aa,2 a,2 b,3

a,2 b,3 b,3

b,3

Where we start from the whole string and for each matched suffix (a dictionary word) we break off at the current index and explore from there.

As you can see we have many repeated calculations, which we can cache.

You can check my implementation of reconstructSentence on my Gist along with some tests in ReconstructSentenceJTests.