5
votes

Can anyone please suggest how to handle the document size exceeds 16MB error while inserting the document into the collection on MongoDB. I got some solutions like GridFS. By using GridsFS can handle this problem but I need a solution without using GridFS. Is there any way to make the document smaller or split into subdocuments. If yes how can we achieve?

from pymongo import MongoClient

conn = MongoClient("mongodb://sample_mongo:27017")
db_conn = conn["test"]
db_collection = db_conn["sample"]

# the size of record is 23MB

record = { \
    "name": "drugs",
    "collection_id": 23,
    "timestamp": 1515065002,
    "tokens": [], # contains list of strings
    "tokens_missing": [], # contains list of strings
    "token_mapping": {} # Dictionary contains transformed tokens
 }

db_collection.insert(record, check_keys=False)

I got the error DocumentTooLarge: BSON document too large. In MongoDB, the maximum BSON document size is 16 megabytes.

  File "/usr/local/lib/python2.7/dist-packages/pymongo-3.5.1-py2.7-linux-x86_64.egg/pymongo/collection.py", line 2501, in insert
check_keys, manipulate, write_concern)
  File "/usr/local/lib/python2.7/dist-packages/pymongo-3.5.1-py2.7-linux-x86_64.egg/pymongo/collection.py", line 575, in _insert
check_keys, manipulate, write_concern, op_id, bypass_doc_val)
  File "/usr/local/lib/python2.7/dist-packages/pymongo-3.5.1-py2.7-linux-x86_64.egg/pymongo/collection.py", line 556, in _insert_one
check_keys=check_keys)
  File "/usr/local/lib/python2.7/dist-packages/pymongo-3.5.1-py2.7-linux-x86_64.egg/pymongo/pool.py", line 482, in command
self._raise_connection_failure(error)
  File "/usr/local/lib/python2.7/dist-packages/pymongo-3.5.1-py2.7-linux-x86_64.egg/pymongo/pool.py", line 610, in _raise_connection_failure
raise error
  DocumentTooLarge: BSON document too large (22451007 bytes) - the connected server supports BSON document sizes up to 16793598 bytes.
2
Welcome to Stack Overflow, Please be a bit more specific when asking a question: What have you tried so far with a code example? (I downvoted because there is no code) / What do you expect? / What error do you get? For Help take a look at "How to ask"Hille
Hille Updated the code which I tried and specified the error. Thanks.Thrisundar Reddy J
Find out what document field makes it so big (tokens, tokens_missing?), store it in a separate collection as a document that holds reference to the original document.Andriy Simonov

2 Answers

2
votes

The maximum BSON document size is 16 megabytes. To store documents larger than the maximum size, MongoDB provides the GridFS API

GridFS is a specification for storing and retrieving files that exceed the BSON-document size limit of 16 MB. GridFS stores the big sized document by dividing it into parts or chunks. Each chunk is stored in a seperate document. Default size of a GridFS chunk is 255 KB. GridFS uses two collections to store files. One collection stores the file chunks, and the other stores file metadata.

1
votes

The quick answer is no, you cannot go around the 16 MB BSON size limitation. If you hit this limit, you will need to explore alternatives such as GridFS or different schema design for your documents.

I would start by asking a series of questions to determine the focus of your design, such as:

  1. You have fields called tokens, tokens_missing, and token_mapping. I imagine these fields are very large individually, and putting all three into one document pushes it to >16 MB. Is it possible to split this document into three collections instead?

  2. What is your application's access pattern? What field do you need to access all the time? What field you don't access that often? You can split up the document into different collections based on those patterns.

  3. Bear in mind the need to index the documents, since MongoDB's performance is highly tied to good indexes that supports your query. You cannot index two arrays in a single index. There are more information in Multikey Indexes.

  4. If you need to combine all the related data in a query, MongoDB 3.2 and newer provides you with the $lookup operator, which is similar to SQL's left outer join.

Unlike SQL's normal form schema design, MongoDB's schema design is based on your application's access pattern. The 16 MB limit is there to let you know that the design is probably not optimal, since such large documents will be detrimental to performance, difficult to update, etc. Typically, it's better to have a lot of small documents as opposed to a few gigantic documents.

More examples can be found in Data Model Design and Data Model Examples and Patterns.