54
votes

I'm looking for an advice about which indexing strategy to use in MongoDb 3.4.

Let's suppose we have a people collection of documents with the following shape:

{
    _id: 10,
    name: "Bob",
    age: 32,
    profession: "Hacker"
}

Let's imagine that a web api to query the collection is exposed and that the only possibile filters are by name or by age.
A sample call to the api will be something like: http://myAwesomeWebSite/people?name="Bob"&age=25

Such a call will be translated in the following query: db.people.find({name: "Bob", age: 25}).

To better clarify our scenario, consider that:

  • the field name was already in our documents and we already have an index on that field
  • we are going to add the new field age due to some new features of our application
  • the database is only accessible via the web api mentioned above and the most important requirement is to expose a super fast web api
  • all the calls to the web api will apply a filter on both the fields name and age (put another way, all the calls to the web api will have the same pattern, which is the one showed above)

That said, we have to decide which of the following indexes offer the best performance:

  • One compound index: {name: 1, age: 1}
  • Two single-field indexes: {name: 1} and {age: 1}

According to some simple tests, it seems that the single compound index is much more performant than the two single-field indexes.

By executing a single query via the mongo shell, the explain() method suggests that using a single compound index you can query the database nearly ten times faster than using two single fields indexes.

This difference seems to be less drammatic in a more realistic scenario, where instead of executing a single query via the mongo shell, multiple calls are made to two different urls of a nodejs web application. Both urls execute a query to the database and return the fetched data as a json array, one using a collection with the single compound index and the other using a collection with two single-field indexes (both collections having exactly the same documents).
In this test the single compound index still seems to be the best choice in terms of performance, but this time the difference is less marked.

According to test results, we are considering to use the single compound index approach.

Does anyone has experience about this topic ? Are we missing any important consideration (maybe some disadvantage of big compound indexes) ?

1

1 Answers

62
votes

Given a plain standard query (with no limit() or sort() or anything fancy applied) that has a filter condition on two fields (as in name and age in your example), in order to find the resulting documents, MongoDB will either:

  1. do a full collection scan (read every document in the entire collection, parse the BSON, find the values in question, test them against the input and return/discard each document): This is super I/O intense and hence slow.
  2. use one index that holds one of the fields (use index tree to locate relevant subset of documents followed by a scan of them): Depending on your data distribution/index selectivity this can be very fast or barely provide any benefit (imagine an index on age in a dataset of millions of people between 30 and 40 years --> every lookup would still yield an endless number of documents).
  3. use two indexes that together contain both fields in question (load both indexes, perform key lookups, then calculate the intersection of the results): Again, depending on your data distribution, this may or may not give you great(er) performance. It should, however, in most cases be faster than #2. I would, however, be surprised if it was really 10x slower then #4 (as you mentioned).
  4. use a compound index (two subsequent key lookups immediately lead to the required documents): This will be the fastest option of all given that it requires the least and cheapest operations to get to the right documents. In order to ensure the greatest level of reuse (not performance which won't be affected by this) you should in general start with the most selective field first, so in your case probably name and not age given that a lot of people will have the same age (so low selectivity) compared to name (higher selectivity). But that choice also depends on your concrete scenario and the queries you intend to run against your database. There is a pretty good article on the web about how to best define a compound index taking various aspects of your specific situation into account: https://emptysqua.re/blog/optimizing-mongodb-compound-indexes

Other aspects to consider are: Index updates come at a certain price. However, if all you care about is raw read speed and you only have a few updates every now and again, then you should go for more/bigger indexes.

And last but not least (!) the well over-used bottom line advice: Profile the hell out of your system using real data and perhaps even realistic load scenarios. And also keep measuring as your data/system changes over time.

Additional reads: https://docs.mongodb.com/manual/core/query-optimization/index.html

https://dba.stackexchange.com/questions/158240/mongodb-index-intersection-does-not-eliminate-the-need-for-creating-compound-in

Index intersection vs. compound index?

mongodb compund index vs. index intersect

How does the order of compound indexes matter in MongoDB performance-wise?

In MongoDB, I am using a large query, how I will create compound index or single index, So My response time boost up