2
votes

I have a collection in MongoDB with a complex structure and subdocuments. The document have an structure like this:

doc1 = {
    '_id': '12345678',
    'url': "http//myurl/...",
    'nlp':{
        "status": "OK",
        "entities": {
            "0": {
                "type" : "Person",
                "relevance": "0.877245",
                "text" : "Neelie Kroes"
            },
            "1": {
                "type": "Company",
                "relevance": "0.36242",
                "text": "ICANN"
            },
            "2": {
                "type": "Company",
                "relevance": "0.265175",
                "text": "IANA" 
            }
        }
    }
}


doc2 = {
    '_id': '987456321',
    'url': "http//myurl2/...",
    'nlp':{
        "status": "OK",
        "entities": {
            "0": {
                "type": "Company",
                "relevance": "0.96",
                "text": "ICANN"
            },
            "1": {
                "type" : "Person",
                "relevance": "0.36242",
                "text" : "Neelie Kroes"
            },
            "2": {
                "type": "Company",
                "relevance": "0.265175",
                "text": "IANA" 
            }
        }
    }
}

My task is to search for "type" AND "text" inside the subdocument, then sort by "relevance". With the $elemMatch operator I'm able to perform the query:

db.resource.find({
    'nlp.entities': {
        '$elemMatch': {'text': 'Neelie Kroes', 'type': 'Person'}
    }
});

Perfect, now I have to sort all the records with entities of type "Person" and value "Neelie Kroes" by relevance descending.

I tried with a normal "sort", but, as the manual said about the sort() in $elemMatch, the result may not reflect the sort order because the sort() was applied to the elements of the array before the $elemMatch projection.

In fact, the _id:987456321 will be the first (with a relevance of 0.96, but referenced to ICANN).

How can I do, to sort my documents by matched subdocument's relevance?

P.S.: I can't change the document structure.

1
Is this coming out as a dump from some tool or is this how your document actually looks in the mongo shell? I say this because how you represent "entities" is as a "sub-document" and not an array. Those cannot be sorted by any standard means.Neil Lunn

1 Answers

2
votes

As noted I hope your documents actually do have an array, but if $elemMatch is working for you then they should.

At any rate, you cannot sort by an element in an array using find. But there is a case where you can do this using .aggregate():

db.collection.aggregate([

    // Match the documents that you want, containing the array
    { "$match": {
        "nlp.entities": {
            "$elemMatch": { 
                "text": "Neelie Kroes", 
                "type": "Person"   
            }
        }
    }},

    // Project to "store" the whole document for later, duplicating the array
    { "$project": {
        "_id": {
            "_id": "$_id",
            "url": "$url",
            "nlp": "$nlp"          
        },
        "entities": "$nlp.entities"
    }},

    // Unwind the array to de-normalize
    { "$unwind": "$entities" },

    // Match "only" the relevant entities
    { "$match": {
        "entities.text": "Neelie Kroes", 
        "entities.type": "Person"   
    }},

    // Sort on the relevance
    { "$sort": { "entities.relevance": -1 } },

    // Restore the original document form
    { "$project": {
        "_id": "$_id._id",
        "url": "$_id.url",
        "nlp": "$_id.nlp"
    }}
])

So essentially, after doing the $match condition for documents that contained the relevant match, you then use $project "store" the original document in the _id field and $unwind a "copy" of the "entities" array.

The next $match "filters" the array contents to only those ones that are relevant. Then you apply the $sort to the "matched" documents.

As the "original" document was stored under _id, you use $project to "restore" the structure that the document actually had to begin with.

That is how you "sort" on your matched element of an array.

Note that if you had multiple "matches" within an array for a parent document, then you would have to employ an additional $group stage to get the $max value for the "relevance" field in order to complete your sort.