Count frequency of multi-word terms in large texts with Python

Question

I have a dictionary with close to a million multi-word terms (terms containing spaces). This looks something like

[..., 
'multilayer ceramic', 
'multilayer ceramic capacitor', 
'multilayer optical disk', 
'multilayer perceptron', 
...]

I would like to count their frequency in many gigabytes of texts.

As a small example consider counting these four multi-word expressions in a Wikipedia page:

payload = {'action': 'query', 'titles': 'Ceramic_capacitor', 'explaintext':1, 'prop':'extracts', 'format': 'json'}
r = requests.get('https://en.wikipedia.org/w/api.php', params=payload)
sampletext = r.json()['query']['pages']['9221221']['extract'].lower()
sampledict = ['multilayer ceramic', 'multilayer ceramic capacitor', 'multilayer optical disk', 'multilayer perceptron']

termfreqdic = {}
for term in sampledict:
    termfreqdic[term] = sampletext.count(term)
print(termfreqdic)

This gives something like {'multilayer ceramic': 7, 'multilayer ceramic capacitor': 2, 'multilayer optical disk': 0, 'multilayer perceptron': 0} but it seems sub-optimal if the dictionary contains a million entries.

I've tried with very large regular expressions:

termlist = [re.escape(w) for w in open('termlistfile.txt').read().strip().split('\n')]
termregex = re.compile(r'\b'+r'\b|\b'.join(termlist), re.I)
termfreqdic = {}
for i,li in enumerate(open(f)):
    for m in termregex.finditer(li):
        termfreqdic[m.group(0)]=termfreqdic.get(m.group(0),0)+1
open('counted.tsv','w').write('\n'.join([a+'\t'+v for a,v in termfreqdic.items()]))

This is dead slow (6 minutes for 1000 lines of text on a recent i7). But if I use regex instead of re by replacing the first two lines, it goes down to around 12s per 1000 lines of text, which is still very slow for my needs:

termlist = open(termlistfile).read().strip().split('\n')
termregex = regex.compile(r"\L<options>", options=termlist)
...

Note that this does not do exactly what I want as one term may be a subterm of another as in the example 'multilayer ceramic' and 'multilayer ceramic capacitor' (which also excludes approaches of first tokenizing as in Find multi-word terms in a tokenized text in Python).

This looks like a common problem of sequence matching, in text corpora or also in genetic strings, that must have well-known solutions. Maybe it can be solved with some trie of words (I don't mind the initial compilation of the term list to be slow)? Alas, I don't seem to be looking for the right terms. Maybe someone can point me in the right direction?

Can you share enough data to make a minimal reproducible example? — AMC
I've added an example with a small sample text and sample dictionary — kmgrds

SidharthMacherla SidharthMacherla · Accepted Answer · 2020-04-16T02:10:44

There is an NLTK approach as given below that works relatively better. The author was not able to reproduce the same sampledict, hence it was created from sampletext for the sake of this exercise. Note: The approach given by the questioner takes approx 60 times more time.

Source data:

#Invoke libraries
import nltk
import requests
import timeit
import pandas as pd

#Souce sample data
payload = {'action': 'query', 'titles': 'Ceramic_capacitor', 'explaintext':1, 'prop':'extracts', 'format': 'json'}
r = requests.get('https://en.wikipedia.org/w/api.php', params=payload)
sampletext = r.json()['query']['pages']['9221221']['extract'].lower()
sampledict = sampletext.split(' ')

Time the old approach:

start = timeit.default_timer()
termfreqdic = {}
for term in sampledict:
    termfreqdic[term] = sampletext.count(term)
stop = timeit.default_timer()
timetaken = stop-start
stop - start 
#0.42748349941757624

Time the NLTK approach:

start = timeit.default_timer()
wordFreq = nltk.FreqDist(sampledict)
stop = timeit.default_timer()
timetaken = stop-start
stop - start 
#0.00713308053673245

Access data by converting the frequency distribution into a dataframe

wordFreqDf = pd.DataFrame(list(wordFreq.items()), columns = ["Word","Frequency"])

#Inspect data
wordFreqDf.head(10)

#output
#                     Word  Frequency
#0              60384-8/21          1
#1                 limited          2
#2                               3618
#3           comparatively          1
#4              code/month          1
#5                    four          1
#6   (microfarads):\n\nµ47          1
#7                consists          1
#8  α\n\t\t\n\t\t\n\n\n===          1

Count frequency of multi-word terms in large texts with Python

2 Answers