If it's a small dataset (e.g. 1K records), you can simply specify size
:
curl localhost:9200/foo_index/_search?size=1000
The match all query isn't needed, as it's implicit.
If you have a medium-sized dataset, like 1M records, you may not have enough memory to load it, so you need a scroll.
A scroll is like a cursor in a DB. In Elasticsearch, it remembers where you left off and keeps the same view of the index (i.e. prevents the searcher from going away with a refresh, prevents segments from merging).
API-wise, you have to add a scroll parameter to the first request:
curl 'localhost:9200/foo_index/_search?size=100&scroll=1m&pretty'
You get back the first page and a scroll ID:
{
"_scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAADEWbmJlSmxjb2hSU0tMZk12aEx2c0EzUQ==",
"took" : 0,
...
Remember that both the scroll ID you get back and the timeout are valid for the next page. A common mistake here is to specify a very large timeout (value of scroll
), that would cover for processing the whole dataset (e.g. 1M records) instead of one page (e.g. 100 records).
To get the next page, fill in the last scroll ID and a timeout that should last until fetching the following page:
curl -XPOST -H 'Content-Type: application/json' 'localhost:9200/_search/scroll' -d '{
"scroll": "1m",
"scroll_id": "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAADAWbmJlSmxjb2hSU0tMZk12aEx2c0EzUQ=="
}'
If you have a lot to export (e.g. 1B documents), you'll want to parallelise. This can be done via sliced scroll. Say you want to export on 10 threads. The first thread would issue a request like this:
curl -XPOST -H 'Content-Type: application/json' 'localhost:9200/test/_search?scroll=1m&size=100' -d '{
"slice": {
"id": 0,
"max": 10
}
}'
You get back the first page and a scroll ID, exactly like a normal scroll request. You'd consume it exactly like a regular scroll, except that you get 1/10th of the data.
Other threads would do the same, except that id
would be 1, 2, 3...
size
query parameter are not correct. Irrespective of value ofsize
in query, ES will return at maxindex.max_result_window
docs (which default to 10k) in response. Referscroll
andsearch_after
. – narendra-choudhary