10
votes

I have an elasticsearch index, my_index, with millions of documents, with key my_uuid. On top of that index I have several filtered aliases of the following form (showing only my_alias as retrieved by GET my_index/_alias/my_alias):

{
    "my_index": {
        "aliases": {
            "my_alias": {
                "filter": {
                    "terms": {
                        "my_uuid": [
                            "0944581b-9bf2-49e1-9bd0-4313d2398cf6",
                            "b6327e90-86f6-42eb-8fde-772397b8e926",
                            thousands of rows...
                        ]
                    }
                }
            }
        }
    }
}

My understanding is that the filter will be cached transparently for me, without having to do any configuration. The thing is I am experiencing very slow searches, when going through the alias, which suggests that 1. the filter is not cached, or 2. it is wrongly written.

Indicative numbers:

GET my_index/_search -> 50ms 
GET my_alias/_search -> 8000ms

I can provide further information on the cluster scale, and size of data if anyone considers this relevant.

I am using elasticsearch 2.4.1. I am getting the right results, it is just the performance that concerns me.

1
what happens when you run the search query directly and add the filter that is applied to the alias. does it take time?pratikvasa
Have you checked that my_uuid is not_analyzed? But thousands of terms on a filter seems quite heavy weight. If you know these uuids at index time you could add a new field aliases to each doc. Then your filter would just have a single term.NikoNyrh
@NikoNyrh my_uuid is not_analyzed. Indeed I know them at index time, but they are dynamically updated in bulk, so I did not want to hard code them into the searchable documents.yannisf
Hi @pratikvasa. I performed the test and got similar times. The thing is, that the query I have to send when not using the alias with the filter is around 4MB due to the number of the my_uuids, and just uploading the query takes about 6 seconds. So I guess this is not considered a viable solution.yannisf
ohk..by similar times you mean you are getting around 8 secs which includes 6 seconds to send the query?pratikvasa

1 Answers

0
votes

Matching each document with a 4MB list of uids is definetly not the way to go. Try to imagine how many CPU cycles it requires. 8s is quite fast.

I would duplicate the subset of data in another index.

If you need to immediately reflect changes, you will have to manage the subset index by hand :

  • when you delete a uuid from the list, you delete the corresponding documents
  • when you add a uuid, you copy the corresponding documents (reindex api with a query is your friend)
  • when you insert a document, you have to check if the document should be added in subset index too
  • when you delete a document, delete it in both indices Force the document id so they are the same in both indices. Beware of refresh time if you store the uuid list in elasticsearch index.

If updating the subset with new uuid is not time critical, you can just run the reindex every day or every hour.