Database structure and search approach for complex search engine based on time and tags

Question

I've been trying to figure this out for two months now, went through countless brainstorming sessions with other developers and still haven't been able to come up with a good solution.

The idea We're building a search engine for conferences, public events, etc.

The data
I have a data set of tens of thousands of events (both future and historical) with the following structure:

{
    id: 10, 
    name: "CES",
    intervals: [
        {
            interval_start: "2013-01-01 08:00", 
            interval_end: "2013-01-15 10:00", 
            tags_by_type: {
                people: [{name: "Eric Schmidt", weight: 20}, ...]
                companies: [{name: "Google", weight: 100}, {name: "Microsoft", weight: 100}, ...],
                topics: [{name: "Social Networking", weight: 80}, {name: "Internet marketing", weight: 95}, ...],
                places: [{name: "Cannes Palace Hotel", weight: 100}, {name: "Cannes", weight: 100}, {name: "France", weight: 100}]
            },
            tags: ["Eric Schmidt", "Google", "Microsoft", "Social Networking", "Internet Marketing", "Cannes Palace Hotel", "Cannes", "France"]
        },
        {
            interval_start: "2011-01-01 10:00", 
            interval_end: "2011-01-15 12:00", 
            tags_by_type: {
                people: [{name: "Marissa Meyer", weight: 20}, ...]
                companies: [{name: "Yahoo", weight: 100}, {name: "Facebook", weight: 100}, ...],
                topics: [{name: "Recruiting", weight: 80}, {name: "Internet marketing", weight: 15}, ...],
                places: [{name: "New york", weight: 100}, {name: "USA", weight: 100}]
            },
            tags: ["Marissa Mayer", "Yahoo", "Facebook", "Recruiting", "Internet marketing", "New york", "USA"]
        },
        ...
    ],

}

We use a normalized MySQL database to add/update/delete events and tags, and then we compile data in various formats (such as the document above) for various search scenarios.

There are hierarchies between tags (Marketing is a parent of Internet Marketing so whenever Internet Marketing is a tag, Marketing will also be a tag)
The weight number represents how important/relevant the respective tag is for the respective time frame

The problem We want to provide the users with a menu they can use to click through and filter events, such as:

Location: [suggested places] USA, France, ... [click to browse all locations]
People: [suggested people] Eric Schmidt, Marissa Meyer, ... [click to browse all people]
Topics: [suggested topics] Internet Marketing, Startups, ... [click to browse all topics]

Clicking any tag in the menu **has to** result in at least one result (no dead-end tags in the menu).
Whenever the user clicks on any of the tags in the menu, the search is performed the menu should repopulate with tags from the subset of events resulting from the search, so the user can continue clicking
Only the top 5 tags, based on their weight, will show before the [click to browse all...] link.
Clicking [click to browse all...] link pops up a hierarhical menu. For locations it would be a list of continents. clicking a continent pulls up a list of countries. clicking a country pulls up a list of cities. No weighting here, just hierarhical browsing

Current approach

Given the above document structure we came up with, search for events using MongoDb if quite simple:

{"intervals.tags": { $in: [selectedtag1, selectedtag2, selectedtag3]}}

However, figuring out which tags to further show the user in the tag menu is proving to be a pain :) Assuming we ignore the weights and just try to figure out the most common tags, we tried this:

db.events.aggregate( { $unwind: "$intervals" }, {$unwind: "$intervals.tags"}, {$group: {"_id": "$intervals.tags", "evCount": {$sum:1}}}, {$match: {"evCount": {$lt: TOTAL_COUNT_OF_EVENTS_MATCHING_OUR_SEARCH}}} );

First problem with that query is that the last condition should ignore tags that relate to ALL events matched (because there would be no point to show a tag that doesn't filter the results when clicked). The query above currently filters out tags that relate to all INTERVALS (in stead of EVENTS).
Second problem with that query is that for large datasets, it will probably run out of memory

We also tried Just for the menu issue, we tried to start with tags, rather than events:

"Eric Schmidt" relates to "Google", "Microsoft", "Social Networking", "Internet Marketing", "Cannes Palace Hotel" ... in the interval "2013-01-01 08:00" and "2013-01-01 10:00"
"Google" relates to "Eric Schmidt", "Microsoft" ...  in the interval "2013-01-01 08:00" and "2013-01-01 10:00"
...

We then mapped these relationships in a MySQL table:

| tag          | related tag | event | start time       | end time         |
----------------------------------------------------------------------------
| Eric Schmidt | Google      | CES   | 2013-01-01 08:00 | 2013-01-01 10:00 |
| Eric Schmidt | Microsoft   | CES   | 2013-01-01 08:00 | 2013-01-01 10:00 |
...

and, assuming the user has selected SELECTED_TAG_1 and SELECTED_TAG_2 from the menu, tried to query it using a SELF JOIN, making sure the intervals match:

SELECT a.related_tag FROM tag_relations a JOIN tag_relations b 
ON a.related_tag = b.related_tag 
AND a.tag = SELECTED_TAG_1 AND b.tag = SELECTED_TAG_2 
AND ( (a.start_time < b.start_time AND a.end_time > b.start_time) OR (a.start_time > b.start_time AND a.start_time < b.end_time) )

but there are two problems:

For each extra tag added to the selection, the interval matching adds complexity (for three tags, we'd match intervals from a with b, b with c and a with c)
it does not return the number of events for each tag so that we can exclude those who match all the resulting events

Do you guys have any thoughts on how to improve any of these two approaches or maybe suggest a new one?

I know this can't be a quick reply and I thank you a million times for taking the time to read and understand this problem.

It seems like there are a lot of complicated relationships represented in your model. Tags, parent tags, dates, events, etc. It also seems like there are dynamic structures being created. You want the most popular tags and they belong to some structure that can be discovered through clicking? I haven't had a chance to try this yet, but I always imagine that once relationships get complex, a graph database might be a good option. It helps you manage explicitly complex relationships (edges) between objects (nodes). I don't know from experience, but I imagine they have good traversal as well. — ryan1234
Hey Ryan, I've actually never even heard of graph databases until now, but I'm definitely going to read up on them. Thanks for the input! — mitai
Check out Titan. thinkaurelius.github.com/titan I heard about it from a video interview with its creator. He also does a good job describing how you might use a graph database: channel9.msdn.com/Shows/Going+Deep/… — ryan1234
thanks again Ryan, I'll check it out right after making sure I understand Neo4j, which seems to have been around for some time now. — mitai

acutesoftware acutesoftware · Accepted Answer · 2013-03-15T05:27:29

One of the problems you have is that it is unlikely that you will ALWAYS have enough data to have at least one result for all combinations of user selections.

And if that is the case, instead of making it complex for yourself, why not do what the other sites do and just show 'No results for blah' and then offer suggestions. For example you could show them partial results of the users selection with one of the filters removed, or you could simply offer them a link to remove (or roll up) their list of current filters.

Database structure and search approach for complex search engine based on time and tags

1 Answers