1
votes

I am trying to calculate the IDF of each words present in a set of documents that I scraped. I stored all the information in the following format:

{
  _id: 1245236476,
  url: https: //something1.com,
    words: {
      doctor: {
        count: 14,
        idf: 0.0
      },
      boss: {
        count: 43,
        idf: 0.0
      },
      teacher: {
        count: 89,
        idf: 0.0
      },
      .......
    },
},
{
  _id: 12346376,
  url: https: //something2.com,
    words: {
      admin: {
        count: 14,
        idf: 0.0
      },
      boss: {
        count: 43,
        idf: 0.0
      },
      student: {
        count: 89,
        idf: 0.0
      },
      .......
    },
},
.........
{
  _id: 57856376,
  url: https: //something3.com,
    words: {
      ads: {
        count: 14,
        idf: 0.0
      },
      web: {
        count: 43,
        idf: 0.0
      },
      teacher: {
        count: 89,
        idf: 0.0
      },
      .......
    },
}

I am trying to count the number of occurrences of each word in my document collection. The collection size is more than 3.5 GB. I have written a code to check whether the implementation is correct or not with sample data containing 1000 documents from my collection. The code in order to achieve this was like the following:

from pymongo import MongoClient
from math import log

def merge(x, y):
    keys = list(y.keys())
    for key in keys:
        x[key] = x.get(key, 0) + 1
    return x

client = MongoClient('mongodb-uri')
pipeline = [
    {
        "$project": {
            "_id": 1,
            "words": 1
        }
    }, {"$limit": 1000}
]

data = list(client['db']['collection'].aggregate(pipeline=pipeline))
document_frequency = {}
for item in data:
    document_frequency = merge(document_frequency, item['words'])

documents = len(data)
idfs = {}
for key, value in document_frequency.items():
    idfs[key] = log(documents / value)

This code produced output of idfs of all words present from those 1000 documents in about a minute. Now when I am trying to compute the idf values of words from all documents after removing the '$limit' stage in pipeline, I am getting a memory error. How to get around this problem using pymongo API or even MongoDB aggregation framework? What could be a better way to solve this issue?

1

1 Answers

0
votes

use python iterate can reduce memory used.

def merge(x, y):
    for key in y:
        x[key] = x.get(key, 0) + 1
    return x

client = MongoClient('mongodb-uri')
total_doc = 0
document_frequency = {}
for doc in client['db']['collection'].find():
    total_doc += 1
    document_frequency = merge(document_frequency, doc['words'])
idfs = {}
for key, value in document_frequency.iteritems():
    idfs[key] = log(total_doc / value)