I'm building a fairly large index - around 3 Billion documents, 2KB average size, nothing fancy like parent/child relationships. At first the bulk indexing was running fine, but it's now slowing down drastically.
Not sure if hardware information is important for the question, but here it is:
The cluster is sitting currently on a single server with 24 cores, 128GB of RAM and a RAID 10 array with 7200K disks and HW controller with BBU. Unfortunately most of the RAM is occupied (around 80GB) by other daemons.
Here's what's important (at least I think so):
I'm providing my own IDs. I've already read Choosing a fast unique identifier (UUID) for Lucene and it all seems logical to me. My IDs are 64-bit integers and will be sequential eventually, but due to various reasons the initial indexing is done in bulks with completely random IDs.
At first I was indexing around 3000 documents per second (the bottleneck is not ES, but the databases from where the documents are pulled). Currently the server is almost stalling from IO (99% read), because of the constant lookups. I have already indexed around 60% of the documents, which took roughly two weeks.
When this initial indexing finishes, I will provide only sequential IDs with a rate of around 100 docs/s. The main question here is - Will the performance then be as good as if the whole index was built using sequential IDs? Becase if the answer is no, then I will abandon the current index and create a new one using the default _id field of ES and my IDs will be in another field. Surely this will require some changes in the application, but my documents are very rarely updated, so it shouldn't be hard.
=== Edit ===
I'm adding a bit more information about my setup:
Number of shards: 6
ES_HEAP_SIZE: 32g
Mapping (PHP array, filters, analyzers and tokenizers are excluded for brevity):
'index' => 'articles',
'body' => [
'settings' => [
'number_of_shards' => 6,
'number_of_replicas' => 0,
'refresh_interval' => -1,
'analysis' => [
'filter' => [
],
'analyzer' => [
],
'tokenizer'=> [
]
]
],
'mappings' => [
'article' => [
'_source' => ['enabled' => false],
'_all' => ['enabled' => false],
'_analyzer' => ['path' => 'lang_analyzer'],
'properties' => [
'lang_analyzer' => [
'type' => 'string',
'doc_values' => true,
'store' => false,
'index' => 'no'
],
'date' => [
'type' => 'date',
'doc_values' => true
],
'feed_id' => [
'type' => 'integer'
],
'feed_subscribers' => [
'type' => 'integer'
],
'feed_canonical' => [
'type' => 'boolean'
],
'title' => [
'type' => 'string',
'store' => false,
],
'content' => [
'type' => 'string',
'store' => false,
],
'url' => [
'type' => 'string',
'analyzer' => 'simple',
'store' => false
]
]
]
]
]
Config (elasticsearch.yml):
node.master: true
node.data: true
plugin.mandatory: analysis-kuromoji,analysis-icu,langdetect-1.4.0.2-1368fbe,analysis-smartcn
bootstrap.mlockall: true
action.disable_delete_all_indices: true;
index.merge.scheduler.max_thread_count: 1
indices.memory.index_buffer_size: 3gb
index.translog.flush_threshold_size: 1gb
index.store.throttle.type: none
I have removed the other services from the host and now all 128GB of memory are for ES. It's now not making any read IO while indexing, because the indexes are cached by the OS.
I am basically indexing documents with autoincrement IDs from MySQL up to a given ID, which I have written down. Those documents are not indexed in sequential order, but very randomly in the whole ID range. There are no duplicate requests (updates) during this indexing phase.
The main question still remains:
When I finish bulk indexing all IDs until my threshold and then start indexing new documents only sequentially, will the indexing performance be the same as if the whole index was built with sequential ids?