0
votes

I am trying to insert a great number of document(+1M) using a bulk_write instruction. In order to do that, I create a list of InsertOne function.

python version = 3.7.4

pymongo version = 3.8.0

Document creation:

document = {
    'dictionary': ObjectId(dictionary_id),
    'price': price,
    'source': source,
    'promo': promo,
    'date': now_utc,
    'updatedAt': now_utc,
    'createdAt:': now_utc
  }

# add line to debug
if '_id' in document.keys():
    print(document)

return document

I create the full list of document by adding a new field from a list of elements and create the query by using InsertOne

bulk = []
for element in list_elements:
    for document in documents:
        document['new_field'] = element
        # add line to debug
        if '_id' in document.keys():
           print(document)
        insert = InsertOne(document)
        bulk.append(insert)
return bulk

I do the insert by using bulk_write command

collection.bulk_write(bulk, ordered=False)

I attach the documentation https://api.mongodb.com/python/current/api/pymongo/collection.html#pymongo.collection.Collection.bulk_write

According to the documentation,the _id field is added automatically Parameter - document: The document to insert. If the document is missing an _id field one will be added.

And somehow it seems that is doing it wrong because some of them have the same value. Receiving this error(with differents _id of course) for 700k of the 1M documents 'E11000 duplicate key error collection: database.collection index: _id_ dup key: { _id: ObjectId(\'5f5fccb4b6f2a4ede9f6df62\') }' Seems a bug to me from pymongo, because I used this approach in many situations but I didn't with such size of documents

The _id field has to be unique for sure, but, due to this is done automatically by pymongo, I don't know how to approach to this problem, perhaps using a UpdateOne with upsert True with an impossible filter and hope for the best.

I would appreciate any solution or work around for this problem

2

2 Answers

0
votes

If any of the documents from your code snippet already contain an _id, a new one won't be added, and you run the risk of getting a duplicate error as you have observed.

0
votes

It seems that as I was adding the new field of the document and append it into the list, I created similar instances of the same element, so I had the same queries len(list_elements) times and that is why I had the duplicated key error.

to solve the problem, I append to the list a copy of the document

bulk.append(document.copy())

and then create the queries with that list

I would like to thank @Belly Buster for his help in the issue