This question almost boils down to this "faceted search, multiple multivalued fields, sorted by weight rather than count".
The database
I have about 10M events, each with multiple editions, each edition being described by tags. There are 5 tag types (places, speakers, participants, topics, industries).
{
title: "CES",
editions: [
{
date: "2013-02-01",
tags: [ {label: "Eric Schmidt", type: "speaker", "popularity": 50}, {label: "Paris", type: "place", "popularity": 30} ]
},
{
date: "2012-01-23",
tags: [ ... ]
}
]
}
Data logic
- Tags are hierarchical, for example, "Eric Schmidt" is filed under Google who is filed under Tech companies. So, whenever Eric is at an event, all three tags are associated with the event.
- Different tags can have different popularity, meaning "Eric Schmidt" would have a popularity of 100, but "Eileen Naughton" would have a popularity of "10".
- The popularity does not apply hierarchically. That means that, if "Eric Schmidt" would leave Google for Foursquare, his popularity would still be 100 and Foursquare would still have popularity 50.
- If at a given time, we find out another "participant" attended, for example, we need to be able to add him as a tag
Search requirements
Now, imagine a left-hand menu with 4 sections:
Places
------------
Paris
London
New York
[more]
Speakers
----------
Google
Facebook
Marc Zuckerberg
[more]
and so on.
Whenever the user clicks on a tag, I want the menu to reflect the results and allow him to drill down further (faceted search). The twist is that when deciding to show "Google" vs "Eric Schmidt" vs "Foursquare" in the first three tags in each section, I want to make sure the most popular tag is shown higher, based on the [number of matching events] * [tag popularity]. That means that if there are 3 matching events for "Foursquare" and only one for "Eric Schmidt" it should show Foursquare first, with a score of 3*50 = 150 vs Schmidt's 1 * 100.
Also, ideally, if I select "Google" then, for the "speakers" section, the system should not return speakers outside Google, even if the matching events also have "Zuckerberg" listed, with a huge popularity of 200. So, the returned tags should reside "beneath" the current selection in each section, and their sorting should be based on the above scoring logic.
Current MongoDB solution
Store a separate document for each edition:
{
event: "CES",
date: "2013-02-01",
tags: [ {label: "Eric Schmidt", type: "speaker", "popularity": 50, path: ",Tech Companies,Google,"}, {label: "Paris", type: "place", "popularity": 30, path: ",Europe,France,"} ]
},
{
event: "CES",
date: "2012-01-23",
tags: [ ... ]
}
Use the aggregation framework
*One query for each tag type (5 queries per request) *
db.events.aggregate(
{
'$match': {'tags.label': {'$all': ["selected tag 1", "selected tag2", ...]}}
},
{
'$unwind': '$tags'
},
// group by events, so we can later sum each tag's popularity only once per event, not per event edition
{
'$group': {
'_id': '$event',
'taglistUnqiue': {
'$addToSet': {
'label': '$tags.label',
'type': '$tags.type',
'popularity': '$tags.popularity'
}
}
}
},
{
'$unwind': '$taglist'
},
{
'$match': {
'taglist.type': "speaker",
/* haven't tested this path-matching, but it should work
to only get the tags that are in the bottom tree
of the current selected speaker tag */
'taglist.path': /^,selected speaker tag,/,
}
},
{
'$group': {
'_id': '$taglist.label',
'score': {
'$sum': '$taglist.popularity'
}
}
});
Ok, this should work, algorithmically, but performance wise it will surely not work for 50M event editions, each with thousands of possible tags.
Can anyone think of another approach? Can this approach be optimized in any way other than using "Map/Reduce" which I understand is way too slow to perform on-the-fly for each user?