2
votes

I'm building a fairly large index - around 3 Billion documents, 2KB average size, nothing fancy like parent/child relationships. At first the bulk indexing was running fine, but it's now slowing down drastically.

Not sure if hardware information is important for the question, but here it is:

The cluster is sitting currently on a single server with 24 cores, 128GB of RAM and a RAID 10 array with 7200K disks and HW controller with BBU. Unfortunately most of the RAM is occupied (around 80GB) by other daemons.

Here's what's important (at least I think so):

I'm providing my own IDs. I've already read Choosing a fast unique identifier (UUID) for Lucene and it all seems logical to me. My IDs are 64-bit integers and will be sequential eventually, but due to various reasons the initial indexing is done in bulks with completely random IDs.

At first I was indexing around 3000 documents per second (the bottleneck is not ES, but the databases from where the documents are pulled). Currently the server is almost stalling from IO (99% read), because of the constant lookups. I have already indexed around 60% of the documents, which took roughly two weeks.

When this initial indexing finishes, I will provide only sequential IDs with a rate of around 100 docs/s. The main question here is - Will the performance then be as good as if the whole index was built using sequential IDs? Becase if the answer is no, then I will abandon the current index and create a new one using the default _id field of ES and my IDs will be in another field. Surely this will require some changes in the application, but my documents are very rarely updated, so it shouldn't be hard.

=== Edit ===

I'm adding a bit more information about my setup:

Number of shards: 6
ES_HEAP_SIZE: 32g
Mapping (PHP array, filters, analyzers and tokenizers are excluded for brevity):

    'index' => 'articles',
    'body' => [
        'settings' => [
            'number_of_shards' => 6,
            'number_of_replicas' => 0,
            'refresh_interval' => -1,
            'analysis' => [
                'filter' => [
                ],
                'analyzer' => [
                ],
                'tokenizer'=> [
                ]
            ]
        ],
        'mappings' => [
            'article' => [
                '_source' => ['enabled' => false],
                '_all' => ['enabled' => false],
                '_analyzer' => ['path' => 'lang_analyzer'],
                'properties' => [
                    'lang_analyzer' => [
                        'type' => 'string',
                        'doc_values' => true,
                        'store' => false,
                        'index' => 'no'
                    ],
                    'date' => [
                        'type' => 'date',
                        'doc_values' => true
                    ],
                    'feed_id' => [
                        'type' => 'integer'
                    ],
                    'feed_subscribers' => [
                        'type' => 'integer'
                    ],
                    'feed_canonical' => [
                        'type' => 'boolean'
                    ],
                    'title' => [
                        'type' => 'string',
                        'store' => false,
                    ],
                    'content' => [
                        'type' => 'string',
                        'store' => false,
                    ],
                    'url' => [
                        'type' => 'string',
                        'analyzer' => 'simple',
                        'store' => false
                    ]
                ]
            ]
        ]
    ]

Config (elasticsearch.yml):

node.master: true
node.data: true
plugin.mandatory: analysis-kuromoji,analysis-icu,langdetect-1.4.0.2-1368fbe,analysis-smartcn
bootstrap.mlockall: true
action.disable_delete_all_indices: true;
index.merge.scheduler.max_thread_count: 1
indices.memory.index_buffer_size: 3gb
index.translog.flush_threshold_size: 1gb
index.store.throttle.type: none

I have removed the other services from the host and now all 128GB of memory are for ES. It's now not making any read IO while indexing, because the indexes are cached by the OS.

I am basically indexing documents with autoincrement IDs from MySQL up to a given ID, which I have written down. Those documents are not indexed in sequential order, but very randomly in the whole ID range. There are no duplicate requests (updates) during this indexing phase.

The main question still remains:

When I finish bulk indexing all IDs until my threshold and then start indexing new documents only sequentially, will the indexing performance be the same as if the whole index was built with sequential ids?

2
did you have any success with performance of this?nefo_x

2 Answers

1
votes

My guess is that the slowdown is not related to providing your own _id field. I suggest watching this video on configuring ES for production, it talks about lots of settings that need to be updated. Primarily, pinning 50% of the machine's memory to the JVM. This was critical for us in production.

http://www.elasticsearch.org/webinars/elasticsearch-pre-flight-checklist/

Of course, you should also have more than 1 node in production on more than 1 machine. ES recommends a minimum of 3 nodes in production.

Another consideration is that 3B records in a single index is quite large. You will probably get better performance out of a rolling index (for instance an index for every 30 days) and then use aliasing to combine all the rolling indexes into a single queryable index when you need it.

Good luck!

1
votes

As far as i see, your case could be related to time-series data and we're talking about 6Tb of data.

Number of shards

do not overshard your data. Make it two servers with less CPU cores and create two shards with one replica. So that you'll have some redundancy if one server fails.

Time series use-case

I assume, that most frequently accessed data would be the latest month or two. Let's say, you have an alias named "events". And it points to all indices named events_2014_12, events_2014_11, events_2014_10, events_2013, events_2012 and so on. The bigger the index is - the more time takes it to add new documents or search through it. With month-based data i don't think that you'll grow more than 100-300Gb indexes on the bottom. You can read about it here.

_id generation

ElasticSearch always ensures that identifiers are evenly. It only makes sense to pre-generate identifier if you have source of data in another storage.

other

If you're up for digging deep into this technology, i can recommend a blog of a company, that offers ElasticSearch as a service.