I've been trying to figure this out for two months now, went through countless brainstorming sessions with other developers and still haven't been able to come up with a good solution.
The idea We're building a search engine for conferences, public events, etc.
The data
I have a data set of tens of thousands of events (both future and historical) with the following structure:
{
id: 10,
name: "CES",
intervals: [
{
interval_start: "2013-01-01 08:00",
interval_end: "2013-01-15 10:00",
tags_by_type: {
people: [{name: "Eric Schmidt", weight: 20}, ...]
companies: [{name: "Google", weight: 100}, {name: "Microsoft", weight: 100}, ...],
topics: [{name: "Social Networking", weight: 80}, {name: "Internet marketing", weight: 95}, ...],
places: [{name: "Cannes Palace Hotel", weight: 100}, {name: "Cannes", weight: 100}, {name: "France", weight: 100}]
},
tags: ["Eric Schmidt", "Google", "Microsoft", "Social Networking", "Internet Marketing", "Cannes Palace Hotel", "Cannes", "France"]
},
{
interval_start: "2011-01-01 10:00",
interval_end: "2011-01-15 12:00",
tags_by_type: {
people: [{name: "Marissa Meyer", weight: 20}, ...]
companies: [{name: "Yahoo", weight: 100}, {name: "Facebook", weight: 100}, ...],
topics: [{name: "Recruiting", weight: 80}, {name: "Internet marketing", weight: 15}, ...],
places: [{name: "New york", weight: 100}, {name: "USA", weight: 100}]
},
tags: ["Marissa Mayer", "Yahoo", "Facebook", "Recruiting", "Internet marketing", "New york", "USA"]
},
...
],
}
We use a normalized MySQL database to add/update/delete events and tags, and then we compile data in various formats (such as the document above) for various search scenarios.
- There are hierarchies between tags (Marketing is a parent of Internet Marketing so whenever Internet Marketing is a tag, Marketing will also be a tag)
- The weight number represents how important/relevant the respective tag is for the respective time frame
The problem We want to provide the users with a menu they can use to click through and filter events, such as:
Location: [suggested places] USA, France, ... [click to browse all locations]
People: [suggested people] Eric Schmidt, Marissa Meyer, ... [click to browse all people]
Topics: [suggested topics] Internet Marketing, Startups, ... [click to browse all topics]
- Clicking any tag in the menu **has to** result in at least one result (no dead-end tags in the menu).
- Whenever the user clicks on any of the tags in the menu, the search is performed the menu should repopulate with tags from the subset of events resulting from the search, so the user can continue clicking
- Only the top 5 tags, based on their weight, will show before the [click to browse all...] link.
- Clicking [click to browse all...] link pops up a hierarhical menu. For locations it would be a list of continents. clicking a continent pulls up a list of countries. clicking a country pulls up a list of cities. No weighting here, just hierarhical browsing
Current approach
Given the above document structure we came up with, search for events using MongoDb if quite simple:
{"intervals.tags": { $in: [selectedtag1, selectedtag2, selectedtag3]}}
However, figuring out which tags to further show the user in the tag menu is proving to be a pain :) Assuming we ignore the weights and just try to figure out the most common tags, we tried this:
db.events.aggregate( { $unwind: "$intervals" }, {$unwind: "$intervals.tags"}, {$group: {"_id": "$intervals.tags", "evCount": {$sum:1}}}, {$match: {"evCount": {$lt: TOTAL_COUNT_OF_EVENTS_MATCHING_OUR_SEARCH}}} );
- First problem with that query is that the last condition should ignore tags that relate to ALL events matched (because there would be no point to show a tag that doesn't filter the results when clicked). The query above currently filters out tags that relate to all INTERVALS (in stead of EVENTS).
- Second problem with that query is that for large datasets, it will probably run out of memory
We also tried Just for the menu issue, we tried to start with tags, rather than events:
"Eric Schmidt" relates to "Google", "Microsoft", "Social Networking", "Internet Marketing", "Cannes Palace Hotel" ... in the interval "2013-01-01 08:00" and "2013-01-01 10:00"
"Google" relates to "Eric Schmidt", "Microsoft" ... in the interval "2013-01-01 08:00" and "2013-01-01 10:00"
...
We then mapped these relationships in a MySQL table:
| tag | related tag | event | start time | end time |
----------------------------------------------------------------------------
| Eric Schmidt | Google | CES | 2013-01-01 08:00 | 2013-01-01 10:00 |
| Eric Schmidt | Microsoft | CES | 2013-01-01 08:00 | 2013-01-01 10:00 |
...
and, assuming the user has selected SELECTED_TAG_1 and SELECTED_TAG_2 from the menu, tried to query it using a SELF JOIN, making sure the intervals match:
SELECT a.related_tag FROM tag_relations a JOIN tag_relations b
ON a.related_tag = b.related_tag
AND a.tag = SELECTED_TAG_1 AND b.tag = SELECTED_TAG_2
AND ( (a.start_time < b.start_time AND a.end_time > b.start_time) OR (a.start_time > b.start_time AND a.start_time < b.end_time) )
but there are two problems:
- For each extra tag added to the selection, the interval matching adds complexity (for three tags, we'd match intervals from a with b, b with c and a with c)
- it does not return the number of events for each tag so that we can exclude those who match all the resulting events
Do you guys have any thoughts on how to improve any of these two approaches or maybe suggest a new one?
I know this can't be a quick reply and I thank you a million times for taking the time to read and understand this problem.