25
votes

I'm working on a solution to store application logs in Elastic Search for many applications across many development teams. The structure of each log entry is identical with an "app" field to indicate the application.

The #1 goal is to support efficient querying within a single "app". Querying across all apps, while still important, would be secondary.

I'm trying to determine what is best:

EDIT: in both cases I will use time-based indexes.

multiple index series

Each "app" would have a series of time-based indexes (app1-2017-04-01,app1-2017-04-02,... etc.) The user would perform searches directly against these smaller indexes. The thought here is that since the indexes are smaller in size, maybe querying against them is faster?

single index series

Use one giant index series to represent all application logs (e.g. logs-2017-04-01, logs-2017-04-02, ... etc) Users would query the "app" field to narrow their search results.

Which is faster in this case? I'm curious about the overhead cost of additional indexes

4
When dealing with logging, you really want to use time-based indexes. The way I'd do it would be to have each app write to its own monthly index (appname__2017-06, anotherapp__2017-06, etc). This way when the time comes to delete old data, you can drop the entire month instead of running expensive and slow delete queries.Ivan
@Johnny currently I am writing daily time based indexes per app. My question is whether I should use one giant time based index or many smaller time based indexesbradforj287
You are approaching this from the wrong angle. You should ask yourself a different set of questions. What's the retention period? How many apps there are? How much data per app (in GBs) per day you predict you'll have? How many ES nodes? What are the hardware resources of these nodes? Are you primarily querying latest period's indices or older indices have an equal amount of queries as well?Andrei Stefan

4 Answers

34
votes

In most cases multiple indexes are better:

  1. Searching against smaller dataset is faster
  2. You are less limited in mapping structure. If you need to change it for new data, you can keep old data without reindexing and just put new mapping for new index
  3. It's more scalable and flexible. You can keep old indexes on a different hard drive or a different machine.
  4. You still can search against multiple indexes, if required.
  5. The overhead for index is small. If you have lots of documents per index, documents take much more space than index metadata. If not, you can take a smaller time period to split your log indexes
8
votes

I will provide hypothetical guidance, since you chose to ignore answers to my questions.

When it comes to a logging use case (time based indices) it is imperative to have at hand some data about future plans: for how long you want to keep the logging data around (retention period), what will be the usage pattern for the collected data (queries frequency, indexing frequency), how much data there will be each day (referring here to data on disk aka shard size). Before thinking at the issue of "per-app-index" or "single-index" do consider the advices below. After you do the math regarding the shards sizes, how many there will be for the chosen retention period, then you can think about per-app or single index.

Depending on the shards sizes especially and the retention period secondly, one would need to consider if the time-based indices will be daily, weekly or monthly. A good rule-of-thumb for the size of a shard is maximum 30-50GB, everything above this would make any recovery, shards relocation, searching potentially slower and potentially affecting the cluster stability.

If your apps are capable of generating large amounts of data daily that would go over the number mentioned above, one shouldn't choose to make the indices per application. If the sizes are smaller, then again it depends. A huge number of shards on one node is wasting resources and would make searching slow. Each shards has a fixed set of memory that is being used just because it exists. Also, when performing searches each shard will perform its search by one thread. One thread is basically one CPU core. The higher the time span being used in searches (more indices being searched), the higher number of concurrent searches happening, the higher the context switching at OS level between multiple threads trying to use the CPU cores. All in all, don't try to squeeze in a single node hundreds of shards, unless only some of them will be used at any given time. If you plan on querying all the data in your cluster most of the time, the number of shards you'd want to have on each node shrinks drastically. Otherwise your cluster will not be able to keep up with the load.

If your logging use case is the one which mostly has high activity on the most recent data (last few days to one week) then consider the approach of hot-warm architecture: https://www.elastic.co/blog/hot-warm-architecture-in-elasticsearch-5-x

The exercise of building and configuring a cluster does always involve testing. So, do please try to test the performance of your queries on a piece of data that's as much as possible identical to real life data. Also, do this on one node that has the hardware specs of the nodes in the production cluster.

6
votes

In terms of performance is better to use a large index than several small indices, as you can see on the article Index vs. Type by Adrien Grand.

An index is stored in a set of shards, which are themselves Lucene indices. This already gives you a glimpse of the limits of using a new index all the time: Lucene indices have a small yet fixed overhead in terms of disk space, memory usage and file descriptors used. For that reason, a single large index is more efficient than several small indices: the fixed cost of the Lucene index is better amortized across many documents.

My suggestion is to use one time-based-index for all applications where each application is a different type of your index. It will make it easier to you when searching on each application log and so straightforward when searching for all applications at the same time.

For example:

If you want to search in one app only you can use:

http://yourserver:9200/logs-2017-04-01/app1/_search

And for all applications:

http://yourserver:9200/logs-2017-04-01/_search

Other good point to evaluate is that each application can have different number of log entries. This way if you have one different index for each app it will be so difficult when sizing your shards for each one. For that reason, the use of only one index will make it easier to you when sizing your cluster. If the index is too large just split it in more shards.

5
votes

Keeping different indexes for different apps gives you flexibility and can, eventually, help you to improve performance by tuning the number of shards/replicas for each app. In any case, you can always allow cross searches by defining aliases or simply by using wildcards.

Considering that multiple teams will access the data, keeping different indexes for different apps is also clearer. Finally, if you eventually want to add some sort of access control (using Shield/X-Pack), having different indexes will definitely make things easier.