Given a list of words, find all possible words generated from the word in a performant way

Question

The way I am thinking of this is, say you have the word "palindrome". From that word, and given the dictionary of possible words which I have learned ("my dictionary"), you can generate:

do
in
drone
mind
plan
line
role
lad
pad
drape
...

Now imagine we do this programmatically, given a dictionary in computer form (a list of words in a text file). The question is how to somewhat efficiently find the words that can be generated from a word, for each word. So for every word in the dictionary, we find every word that can be generated from it.

The rules for generating the words from a given word are that you can only use each letter once, but you can pick which letter you want in any order.

The naive way I am starting to try to solve this problem is by first loading all words into a hashtable, an object in JavaScript. Seems like maybe putting them into a trie would be better, but not sure. Then iterate through each word in the hashtable. Try generating all possible combinations of letters of all possible lengths for the given word. I have no idea how to mathematically calculate how many combinations this is, but it seems like a lot. So for "palindrome" that is 10 letters, but seems like a ton of combinations just for 10 letter combinations, not to mention 9, 8, 7, 6....

So wondering how you might go about this more efficiently. I'm sure someone has already found the trick to solve this nicely.

Note, this is not a homework or interview question, I am just curious how to do this efficiently.

In the worst case you need to output the entire dictionary (when the input "word" is something like "aaaaabbbbbccccc...zzzzz"). So you can just check your input word against each word in the ductionary, wiewed as a bag of letters. If you need better complexity, say O(size-of-output) rather than O(size-of-input), you need somewhat more advanced techniques. — n. 1.8e9-where's-my-share m.
The total number of permutations of 10 letters would be 10!, or 3,628,800. Of course, for 11 letters it would be close to 40 million. You definitely don't want to do an exhaustive search . . . — Jim Mischel

Mark Mark · Accepted Answer · 2019-05-19T07:47:05

This might be a nice application of a trie data structure. This allows you to start looking through permutations of the word, while exiting early if there's no chance that word will be in your dictionary. For example, there's a lot of permutations of palindrome that start dp, but you won't consider any of them because dp is a dead end in the trie.

As you are looking through palindrome you might be looking at mode. You'll add that to the list of found words. There are further children in the trie for mode, but none for modep, modei, etc. you you can cut off those entire branches in your search. You will just continue on branch that have children leading on to words like model, modern.

It's fairly easy to turn a list of words into a trie using a dictionary in python:

trie = {}

with open('words.txt') as words:
    for word in map(lambda w: w.strip(), words):
        cur = trie
        for l in word:
            cur  = cur.setdefault(l, {})
        cur['word'] = True # defined if this node indicates a complete word

With a reasonable wordlist the root dictionary will probably have a key for every letter in the alphabet. But it quickly gets smaller as you descend. For example with a small word list looking up trie['w'] will have keys like ['a', 'e', 'h', 'i', 'o', 'r'] representing words in your dictionary starting with wa, we, ...etc. trie['q'] might just have one key u unless you have a dictionary with less common words.

Once you've built your trie, you can iteratively go thought the permutations of the word in question, adding words as you find them. This will go considerable faster than looking at all permutation because you exit the branch early when there's not more letters that match keys in the current branch of the trie.

Given the trie built above and a wordlist of 3000 common words, this quickly finds 100 words by recursively crawling the trie. Counting the number of times the inner for loop runs shows 2670 with a 3000 word dictionary -- not too bad considering there are 3.6 million permutations of the letters in the 'palindrome'. Using the much larger wordlist in /usr/share/dict/words which has roughly 250,000 words required 21543 loops. Looking up 'quilt' only ran 100 times with the large list.

Here's some Python code for traversing the trie for each valid permutation.

def findWords(word, trie = trie, cur = '', words = []):
    for i, letter in enumerate(word):
        if letter in trie:
            if 'word' in trie[letter]:
                words.append(cur + letter)
            findWords(word[:i] + word[i+1:], trie[letter], cur+letter, words )

    return words
words = findWords("palindrome")

Result:

['palm',
 'pale',
 'pain',
 'pair',
 'pan',
 'panel',
 'plan',

  ...

 'media',
 'ear',
 'earn',
 'end',
 'era']

Given a list of words, find all possible words generated from the word in a performant way

2 Answers