2
votes

I have a series of phrases that occur in a larger text. I would like to emphasize the phrases, but I want to first compact the phrases. I am using Python 3.5 and NLTK for most of the processing.

For instance, if I have the sentence:

The quick brown fox jumped over the lazy dog

and the phrases

brown fox

quick brown fox

I want the resulting HTML to look like

The <b>quick brown fox</b> jumped over the lazy dog

not

The <b>quick <b>brown fox</b></b> jumped over the lazy dog

It seems like I should be able to craft some sort of list comprehension that removes items that are a subset of of other items in the list, but I can't quite seem to wrap my head around it. Any ideas about how I collapse my phrases to remove subsets of other entries?

2

2 Answers

1
votes

If the terms are listed in a list:

terms = ['brown fox', 'quick brown', 'quick brown fox']

I would create a list of subsets by checking the term list against itself, and collecting all terms that are subsets of other terms in the list:

subsets = []
for x in terms:
    for y in terms:
        if x in y and x != y:
            subsets.append(x)

or using list comprehension:

subsets = [x for x in terms for y in terms if x in y and x != y]

then remove all the known subsets from the list of terms:

phrases = [x for x in terms if x not in subsets]

or in a one-liner (maybe not recommended since it's quite unreadable):

phrases = [z for z in terms if z not in [x for x in terms for y in terms if x in y and x != y]]

should give you:

>>> print(phrases)
['quick brown fox']
1
votes

I think the best way will be to write your own parser for the same. This way rather than removing the additional tags we wont insert them at all. You can one by one scan for the characters in the sentence and match them with characters in your phrases. If there is a match we insert the tags at proper location.

Also I have arranged phrases in descending order of their length. So nested tags are automatically avoided. As soon as there is a match further phrases will not be checked.

Here is my parser:

#sentence is a string
#phrases are considered as list
def highlightphrases(sentence, phrases):
    phrases.sort(key=len, reverse=True)
    sentenceCharIndex = 0
    while sentenceCharIndex < len(sentence):
        for phrase in phrases:
            phraseCharIndex = 0
            while phraseCharIndex < len(phrase) and \
                  sentenceCharIndex + phraseCharIndex < len(sentence) and \
                  phrase[phraseCharIndex] == sentence[sentenceCharIndex + phraseCharIndex]:
                phraseCharIndex += 1
            if(phraseCharIndex == len(phrase)):
                sentence = sentence[:sentenceCharIndex+phraseCharIndex] +\
                           "</b>" + sentence[sentenceCharIndex+phraseCharIndex:]
                sentence = sentence[:sentenceCharIndex] +\
                           "<b>" + sentence[sentenceCharIndex:]
                sentenceCharIndex += phraseCharIndex + 6
                break;
        sentenceCharIndex+=1
    return sentence

Note : I am not basically a python programmer so please don't mind if code is shabby, let me know if the syntax for the answer can be improved in anyway. Do suggest edits. I am new to python and still learning the ways, suggestions are always welcome :)