Collapse list of phrases by removing subsets in Python

Question

I have a series of phrases that occur in a larger text. I would like to emphasize the phrases, but I want to first compact the phrases. I am using Python 3.5 and NLTK for most of the processing.

For instance, if I have the sentence:

The quick brown fox jumped over the lazy dog

and the phrases

brown fox

quick brown fox

I want the resulting HTML to look like

The <b>quick brown fox</b> jumped over the lazy dog

not

The <b>quick <b>brown fox</b></b> jumped over the lazy dog

It seems like I should be able to craft some sort of list comprehension that removes items that are a subset of of other items in the list, but I can't quite seem to wrap my head around it. Any ideas about how I collapse my phrases to remove subsets of other entries?

kevinadi kevinadi · Accepted Answer · 2016-12-13T06:42:42

If the terms are listed in a list:

terms = ['brown fox', 'quick brown', 'quick brown fox']

I would create a list of subsets by checking the term list against itself, and collecting all terms that are subsets of other terms in the list:

subsets = []
for x in terms:
    for y in terms:
        if x in y and x != y:
            subsets.append(x)

or using list comprehension:

subsets = [x for x in terms for y in terms if x in y and x != y]

then remove all the known subsets from the list of terms:

phrases = [x for x in terms if x not in subsets]

or in a one-liner (maybe not recommended since it's quite unreadable):

phrases = [z for z in terms if z not in [x for x in terms for y in terms if x in y and x != y]]

should give you:

>>> print(phrases)
['quick brown fox']

Collapse list of phrases by removing subsets in Python

2 Answers