python symmetric word matrix using nltk

Question

I'm trying to create a symmetric word matrix from a text document.

For example: text = "Barbara is good. Barbara is friends with Benny. Benny is bad."

I have tokenized the text document using nltk. Now I want to count how many times other words appear in the same sentence. From the text above, I want to create the matrix below:

        Barbara good    friends Benny   bad
Barbara 2   1   1   1   0
good    1   1   0   0   0
friends 1   0   1   1   0
Benny   1   0   1   2   1
bad     0   0   1   1   1

Note the diagonals are the frequency of the word. Since Barbara appears with Barbara in a sentence as often as there are Barbaras. I hope to not overcount, but this is not a big issue if the code becomes too complicated.

qwwqwwq qwwqwwq · Accepted Answer · 2013-07-03T23:21:06

First we tokenize the text, iterate through each sentence, and iterate through all pairwise combinations of the words in each sentence, and store out counts in a nested dict:

from nltk.tokenize import word_tokenize, sent_tokenize
from collections import defaultdict
import numpy as np
text = "Barbara is good. Barbara is friends with Benny. Benny is bad."

sparse_matrix = defaultdict(lambda: defaultdict(lambda: 0))

for sent in sent_tokenize(text):
    words = word_tokenize(sent)
    for word1 in words:
        for word2 in words:
            sparse_matrix[word1][word2]+=1

print sparse_matrix
>> defaultdict(<function <lambda> at 0x7f46bc3587d0>, {
'good': defaultdict(<function <lambda> at 0x3504320>, 
    {'is': 1, 'good': 1, 'Barbara': 1, '.': 1}), 
'friends': defaultdict(<function <lambda> at 0x3504410>, 
    {'friends': 1, 'is': 1, 'Benny': 1, '.': 1, 'Barbara': 1, 'with': 1}), etc..

This is essentially like a matrix, in that we can index sparse_matrix['good']['Barbara'] and get the number 1, and index sparse_matrix['bad']['Barbara'] and get 0, but we actually aren't storing counts for any words that never co-occured, the 0 is just generated by the defaultdict only when you ask for it. This can really save a lot of memory when doing this stuff. If we need a dense matrix for some type of linear algebra or other computational reason, we can get it like this:

lexicon_size=len(sparse_matrix)
def mod_hash(x, m):
    return hash(x) % m
dense_matrix = np.zeros((lexicon_size, lexicon_size))

for k in sparse_matrix.iterkeys():
    for k2 in sparse_matrix[k].iterkeys():
        dense_matrix[mod_hash(k, lexicon_size)][mod_hash(k2, lexicon_size)] = \
            sparse_matrix[k][k2]

print dense_matrix
>>
[[ 0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  1.  1.  1.  1.  0.  1.]
 [ 0.  0.  1.  1.  1.  0.  0.  1.]
 [ 0.  0.  1.  1.  1.  1.  0.  1.]
 [ 0.  0.  1.  0.  1.  2.  0.  2.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  1.  1.  1.  2.  0.  3.]]

I would recommend looking at http://docs.scipy.org/doc/scipy/reference/sparse.html for other ways of dealing with matrix sparsity.

python symmetric word matrix using nltk

2 Answers