I am trying to calculate the IDF of each words present in a set of documents that I scraped. I stored all the information in the following format:
{
_id: 1245236476,
url: https: //something1.com,
words: {
doctor: {
count: 14,
idf: 0.0
},
boss: {
count: 43,
idf: 0.0
},
teacher: {
count: 89,
idf: 0.0
},
.......
},
},
{
_id: 12346376,
url: https: //something2.com,
words: {
admin: {
count: 14,
idf: 0.0
},
boss: {
count: 43,
idf: 0.0
},
student: {
count: 89,
idf: 0.0
},
.......
},
},
.........
{
_id: 57856376,
url: https: //something3.com,
words: {
ads: {
count: 14,
idf: 0.0
},
web: {
count: 43,
idf: 0.0
},
teacher: {
count: 89,
idf: 0.0
},
.......
},
}
I am trying to count the number of occurrences of each word in my document collection. The collection size is more than 3.5 GB. I have written a code to check whether the implementation is correct or not with sample data containing 1000 documents from my collection. The code in order to achieve this was like the following:
from pymongo import MongoClient
from math import log
def merge(x, y):
keys = list(y.keys())
for key in keys:
x[key] = x.get(key, 0) + 1
return x
client = MongoClient('mongodb-uri')
pipeline = [
{
"$project": {
"_id": 1,
"words": 1
}
}, {"$limit": 1000}
]
data = list(client['db']['collection'].aggregate(pipeline=pipeline))
document_frequency = {}
for item in data:
document_frequency = merge(document_frequency, item['words'])
documents = len(data)
idfs = {}
for key, value in document_frequency.items():
idfs[key] = log(documents / value)
This code produced output of idfs of all words present from those 1000 documents in about a minute. Now when I am trying to compute the idf values of words from all documents after removing the '$limit' stage in pipeline, I am getting a memory error. How to get around this problem using pymongo API or even MongoDB aggregation framework? What could be a better way to solve this issue?