Anagram generation - Isnt it kind of subset sum?

3

votes

Anagram:

An anagram is a type of word play, the result of rearranging the letters of a word or phrase to produce a new word or phrase, using all the original letters exactly once;

Subset Sum problem:

The problem is this: given a set of integers, is there a non-empty subset whose sum is zero?

For example, given the set { −7, −3, −2, 5, 8}, the answer is yes because the subset { −3, −2, 5} sums to zero. The problem is NP-complete.

Now say we have a dictionary of n words. Now Anagram Generation problem can be stated as to find a set of words in dictionary(of n words) which use up all letters of the input. So does'nt it becomes a kind of subset sum problem.

Am I wrong?

algorithmlanguage-agnosticsubset-sum

would you please mark an accepted answer – Raymond Hettinger

4

votes

The two problems are similar but are not isomorphic.

In an anagram the order of the letters matters. In a subset sum, the order does not matter.
In an anagram, all the letters must be used. In a subset sum, any subset will do.
In an anagram, the subgroups must form words taken from a comparatively small dictionary of allowable words (the dictionary). In a subset sum, the groups are unrestricted (no dictionary of allowable groupings).

3

votes

If you'd prove that solving anagram finding (not more than polynomial number of times) solves subset sum problem - it would be a revolution in computer science (you'd prove P=NP).

Clearly finding anagrams is polynomial-time problem:

Checking if two records are anagrams of each other is as simple as sorting letters and compare the resulting strings (that is C*s*log(s) time, where s - number of letters in a record). You'll have at most n such checks, where n - number of records in a dictionary. So obviously the running time ~ C*s*log(s)*n is limited by a polynomial of input size - your input record and dictionary combined.

EDIT:

All the above is valid only if the anagram finding problem is defined as finding anagram of the input phrase in a dictionary of possible complete phrases.

While the wording of the anagram finding problem in the original question above...

Now say we have a dictionary of n words. Now Anagram Generation problem can be stated as to find a set of words in dictionary(of n words) which use up all letters of the input.

...seems to imply something different - e.g. a possibility that some sort of composition of more than one entry in a dictionary is also a valid choice for a possible anagram of the input.

This however seems immediately problematic and unclear because (1) usually phrase is not just sequence of random words (it should make sense as a whole phrase), (2) usually words in a phrase require separators that are also symbols - so it is not clear if the separators (whitespace characters) are required in the input to allow the separate entries in a dictionary and if separators are allowed in a single dictionary entry.

So in my initial answer above I applied a "semantic razor" by interpreting the problem definition the only way it is unambiguous and makes sense as an "anagram finding".

But also we might interpret the authors definition like this:

Given the dictionary of n letter sequences (separate dictionary entries may contain same sequences) and one target letter sequence - find any subset of the dictionary entries that if concatenated together would be exact rearrangement of the target letter sequence OR determine that such subset does not exist.

^^^- Even though this problem no longer really makes perfect sense as an "anagram finding problem" still it is interesting. It is very different problem to what I considered above.

One more thing remains unclear - the alphabet flexibility. To be specific the problem definition must also specify whether set of letters is fixed OR it is allowed to redefine it for each new solution of the problem when specifying dictionary and target sequence of said letters. That's important - capabilities and complexity depends on that.

The variant of this problem with the ability to define the alphabet (available number of letters) for each solution individually actually is equivalent to a subset sum problem. That makes it NP-complete.

I can prove the equivalence of our problem to a natural number variant of subset sum problem defined as

Given the collection (multiset) of natural numbers (repeated numbers allowed) and the target natural number - find any sub-collection that sums exactly to the target number OR determine that such sub-collection does not exist.

It is not hard to see that mostly linear number of steps is enough to translate one problem input to another and vice versa. So the solution of one problem translates to exactly one solution of another problem plus mostly linear overhead.

This positive-only variant of subset-sum is equivalent to zero-sum subset-sum variant given by the author (see e.g. Subset Sum Wikipedia article).

2

votes

I think you are wrong.

Anagram Generation must be simpler than Subset Sum, because I can devise a trivial O(n) algorithm to solve it (as defined):

initialize the list of anagrams to an empty list
iterate the dictionary word by word
    if all the input letters are used in the ith word
        add the word to the list of anagrams

return the list of anagrams

Also, anagrams consist of valid words that are permutations of the input word (i.e. rearrangements) whereas subsets have no concept of order. They may actually include less elements than the input set (hence sub set) but an anagram must always be the same length as the input word.

1

votes

It isn't NP-Complete because given a single set of letters, the set of anagrams remains identical regardless.

There is always a single mapping that transforms the letters of the input L to a set of anagrams A. so we can say that f(L) = A for any execution of f. I believe, if I understand correctly, that this makes the function deterministic. The order of a Set is irrelevant, so considering a differently ordered solution non-deterministic is invalid, it is also invalid because all entries in a dictionary are unique, and thus can be deterministically ordered.

Anagram generation - Isnt it kind of subset sum?

4 Answers