Elasticsearch indexing performance getting worse with my own _id

Question

I'm building a fairly large index - around 3 Billion documents, 2KB average size, nothing fancy like parent/child relationships. At first the bulk indexing was running fine, but it's now slowing down drastically.

Not sure if hardware information is important for the question, but here it is:

The cluster is sitting currently on a single server with 24 cores, 128GB of RAM and a RAID 10 array with 7200K disks and HW controller with BBU. Unfortunately most of the RAM is occupied (around 80GB) by other daemons.

Here's what's important (at least I think so):

I'm providing my own IDs. I've already read Choosing a fast unique identifier (UUID) for Lucene and it all seems logical to me. My IDs are 64-bit integers and will be sequential eventually, but due to various reasons the initial indexing is done in bulks with completely random IDs.

At first I was indexing around 3000 documents per second (the bottleneck is not ES, but the databases from where the documents are pulled). Currently the server is almost stalling from IO (99% read), because of the constant lookups. I have already indexed around 60% of the documents, which took roughly two weeks.

When this initial indexing finishes, I will provide only sequential IDs with a rate of around 100 docs/s. The main question here is - Will the performance then be as good as if the whole index was built using sequential IDs? Becase if the answer is no, then I will abandon the current index and create a new one using the default _id field of ES and my IDs will be in another field. Surely this will require some changes in the application, but my documents are very rarely updated, so it shouldn't be hard.

=== Edit ===

I'm adding a bit more information about my setup:

Number of shards: 6
ES_HEAP_SIZE: 32g
Mapping (PHP array, filters, analyzers and tokenizers are excluded for brevity):

    'index' => 'articles',
    'body' => [
        'settings' => [
            'number_of_shards' => 6,
            'number_of_replicas' => 0,
            'refresh_interval' => -1,
            'analysis' => [
                'filter' => [
                ],
                'analyzer' => [
                ],
                'tokenizer'=> [
                ]
            ]
        ],
        'mappings' => [
            'article' => [
                '_source' => ['enabled' => false],
                '_all' => ['enabled' => false],
                '_analyzer' => ['path' => 'lang_analyzer'],
                'properties' => [
                    'lang_analyzer' => [
                        'type' => 'string',
                        'doc_values' => true,
                        'store' => false,
                        'index' => 'no'
                    ],
                    'date' => [
                        'type' => 'date',
                        'doc_values' => true
                    ],
                    'feed_id' => [
                        'type' => 'integer'
                    ],
                    'feed_subscribers' => [
                        'type' => 'integer'
                    ],
                    'feed_canonical' => [
                        'type' => 'boolean'
                    ],
                    'title' => [
                        'type' => 'string',
                        'store' => false,
                    ],
                    'content' => [
                        'type' => 'string',
                        'store' => false,
                    ],
                    'url' => [
                        'type' => 'string',
                        'analyzer' => 'simple',
                        'store' => false
                    ]
                ]
            ]
        ]
    ]

Config (elasticsearch.yml):

node.master: true
node.data: true
plugin.mandatory: analysis-kuromoji,analysis-icu,langdetect-1.4.0.2-1368fbe,analysis-smartcn
bootstrap.mlockall: true
action.disable_delete_all_indices: true;
index.merge.scheduler.max_thread_count: 1
indices.memory.index_buffer_size: 3gb
index.translog.flush_threshold_size: 1gb
index.store.throttle.type: none

I have removed the other services from the host and now all 128GB of memory are for ES. It's now not making any read IO while indexing, because the indexes are cached by the OS.

I am basically indexing documents with autoincrement IDs from MySQL up to a given ID, which I have written down. Those documents are not indexed in sequential order, but very randomly in the whole ID range. There are no duplicate requests (updates) during this indexing phase.

The main question still remains:

When I finish bulk indexing all IDs until my threshold and then start indexing new documents only sequentially, will the indexing performance be the same as if the whole index was built with sequential ids?

jhilden jhilden · Accepted Answer · 2015-01-12T15:48:20

My guess is that the slowdown is not related to providing your own _id field. I suggest watching this video on configuring ES for production, it talks about lots of settings that need to be updated. Primarily, pinning 50% of the machine's memory to the JVM. This was critical for us in production.

http://www.elasticsearch.org/webinars/elasticsearch-pre-flight-checklist/

Of course, you should also have more than 1 node in production on more than 1 machine. ES recommends a minimum of 3 nodes in production.

Another consideration is that 3B records in a single index is quite large. You will probably get better performance out of a rolling index (for instance an index for every 30 days) and then use aliasing to combine all the rolling indexes into a single queryable index when you need it.

Good luck!

Elasticsearch indexing performance getting worse with my own _id

2 Answers

Number of shards

Time series use-case

_id generation

other