This is a solution to a sample “tech interview” question posted in a video in Google career. The problem is
Write a program that breaks up a string of words with no spaces into a string with the appropriate spaces.
If you just want to download the solution, click here.
Personally I have learned a lot from the video, e.g. clarify the question, using libraries (e.g. a dictionary) and APIs, find and solve a simplified version of the problem, etc. The Google people in the video even gave a sample solution involving two words. It seems easy to use recursion to recognize three or four words.
But there is one missing piece… a real solution of the problem. The “method” was only about how to split into two words or three words. It is not scalable to large strings because that algorithm has complexity O(nk), where k is the number of words in the string.
Solving the problem
First of all, the problem can be into two tasks:
- Finding the dictionary word matches in the string.
- Put the word matches together into a solution.
Finding the dictionary word matches in the string
For this task, we have n possible starting positions in the string. If we take all the possible end positions, then we need to match with complexity O(n2). It is still slow, but much better than the method before.
Wait… Can the dictionary arranged in some way that matching can be more efficient? The answer is yes!
Store the dictionary into a trie. Then all end positions can be matched with a single lookup. For example, if “chairmanzzzz” is matched against a trie, both “chair” and “chairman” can be return at the same time.
Since each dictionary word only consist of a few letters (i.e. averages to a small constant), the matching can be done in constant time no matter how long the given string is. It means the matching has complexity O(n) now.
(Note: There is a powerful Aho-Corasick algorithm that uses a modified trie to return all dictionary matches with one single lookup. Since I just want to test the approach, I decided not to use Aho-Corasick because it adds complexity.)
Put the word matches together into a solution
Now I have all the dictionary matches. How do we find a solution?
Now view the match “starting from position 3 with a length of 4 letters” as “a directed edge from 3 to 7”. With directed edges we have an obvious solution: Dijkstra algorithm.
(Bonus question: what can be stored in the weights of the edges? Is it to possible take an advantage of these weights?)
Here are the steps of Dijkstra algorithm:
- Assign node 0 with tentative distance 0. All nodes are unvisited.
- Find a unvisited node k, which has a minimum tentative distance (from node 0).
- Find the new tentative distance of all unvisited nodes connected to node k.
- Mark node k visited.
- Repeat steps (2) and (4) until there are no unvisited nodes with a tentative distance.
In this problem, step 2 can be replaced by “find a minimum value k that node (position) k is a unvisited node with a tentative distance.” The proof of the validity of this modification is left to readers.
The solution was coded using C++. A few points are noted here:
All letters are converted into lower case, and all non-letters are removed. Spaces in the user input are also removed for easy testing.
A word list from SCOWL is used.
For Dijkstra algorithm, a vector of matches is stored instead of just the best one. This creates the possibility of producing multiple results later.
What to do if the end cannot be reached from the start? Just go as far as we can, output that part of the result, and do Dijkstra algorithm again starting from the next character.
A high-resolution timer written by Song Ho Ahn (2003) is used for timing.
To make testing results easy to read, the given string is splitted using digits as the boundaries. The blocks of digits are simply outputted as single words.
Testing and modifications
Testing is done against simple phrases and passages in the Wikipedia. As expected, my first try does yield a result quickly.
But the splitting result is not that good. The string iamahero turns out to be i a ma hero instead of i am a hero. (Note: “ma” is a dictionary word.)
Solution: Put frequently used words a higher priority. The SCOWL dictionary comes with different numerical sizes! What a coincidence! So these sizes can be simply used as weights in Dijkstra algorithm for better results.
Now there is one other problem: eyesonme becomes eye son me instead of eyes on me.
Solution: The “s” looks better to be put in the first word (as plurals) instead of the second word. So the weights of all words that end with “s” are subtracted by 2.
The revised version returns better results. So we try splitting a whole passage from Wikipedia.
The timing data in the output was produced in a ZOTAC ZBOX AD-02 with an AMD E-350 CPU.
** Input data (spaces are removed) **
** Output **
English words are almost broken up perfectly. Only a few artifact remains: “Ian”, “Herman” (names) and some latin words that is not stored in the dictionary.
Only a few milliseconds is used for splitting, and that already includes I/O time. (cin and cout are very inefficient.)
It means that the given problem is practically solved. Of course improvements can be made, including the following:
Separate I/O from algorithm execution for better timing.
Improve I/O performance.
Create a GUI program. (Use wxWidgets.)
Tweaking the weights of words. (How to do this automatically? User votes?)
Handling of diacitrics. (Currently letters with diacitrics are skipped. They should be converted to the corresponding letter without diacitrics instead.)
Text encoding (for diacitrics)
Add punctuations and capital letters. (This is even harder than this problem.)
Download the solution
The solution can be downloaded here.