9
votes

I store different kinds of documents in a single index with strict predefined mapping. All of them have some field (say, "body"), but I'd want them to be analyzed slightly differently when indexed (for example, to use different token filters for specific documents) and treaten the same way while searched. As far as I know, analyzers can't be specified per document.

What I also considered to use:

  1. Object fields with differently analyzed subfields for document kinds, so each document has only one filled subfield (like, "body.mail", "body.html"). The problem is that I couldn't search on the whole "body" field which would look through all its subfields (to not break the existing application).
  2. New reincarnation of multi-fields (to have "body" field with a generic analyzer and custonly analyzed "mail", "html", etc. inside it). Hovewer, I'm not sure if it's possible to use them directly while indexing and indirectly while searching (e.g., to save object with {"mail":"smth"} to use a specific index analyzer, then search by "query":{"body":"smth"} to use generic search analyzer).
  3. To separate "body" into several fields with different mappings, remove them from _all, and set copy_to to a single body field. I'm not sure, but it will add a substantial index overhead due to copying.
2
Why not index different fields such as "mail", "html" etc, have a different analyzer for each, and use a multi match query to search on all these fields? elastic.co/guide/en/elasticsearch/reference/current/…Ita
In my opinion, these two requirements are not possible together: search on the whole "body" field which would look through all its subfields (**to not break the existing application**) and analyzed slightly differently when indexed and treaten the same way while searched. Something's got to give.Andrei Stefan
@Ita Legacy reasons. There's a lot of search queries on that field already, so it'd be hard and boilerplate-prone to replace each with multi match.Yuuri
"copy_to to a single body field" will use the analyzer of the body field so, even if you had different analyzers on the fields that have copy_to in the end inside body you will get text analyzed by the body field analyzer.Andrei Stefan

2 Answers

15
votes

As I mentioned in the comments, what you want is not possible. Your requirement, in one sentence, is: have the same data analyzed in multiple ways, but searched as a single field because this would break the existing application.

             -- body.html          
             -- body.email
body field ---- body.content     --- all searched as "body"
            ...
             -- body.destination
             -- body.whatever
  • Your first option is multi-fields which has this exact purpose in mind: have the same data analyzed multiple ways. The problem is that you cannot search for "body" and expect ES to search body.html, body.email... Even if this would be possible, you want to be searched with different analyzers. Again, not possible. This option requires you to change the application and search for each field in a multi_match or in a query_string.

  • Your second option - reincarnation of multi-fields - will again not work because you cannot refer to body and ES, in the background, to match mail, content etc.

  • Third option - using copy_to - will not work because copying to another field "X" means indexing the data being copied will be analyzed with X's analyzer, and this breaks your requirement of having the same data analyzed differently.

  • There could be a fourth option - "path": "just_name" from multi_fields - which at a first look it should work. Meaning, you can have 3 multi-fields (email, content, html) which all three have a body sub-field. Having "path": "just_name" allows you to search just for body even if body is a sub-field of multiple other fields. But this is not possible because this type of multi-fields will not accept different analyzers for the same body.

Either way, you need to change something in your requirements, because they will not work they way you want it.


These being said, I'm curious to see what queries are you using in your application. It would be a simple change (yes, you will need to change your app) from querying body field to querying body.* in a multi_match.

And I have another solution for you: create multiple indices, one index for each analyzer of your body. For example, for mail, content and html you define three indices:

PUT /multi_fields1
{
  "mappings": {
    "test": {
      "properties": {
        "body": {
          "type": "string",
          "index_analyzer": "whitespace",
          "search_analyzer": "standard"
        }
      }
    }
  }
}
PUT /multi_fields2
{
  "mappings": {
    "test": {
      "properties": {
        "body": {
          "type": "string",
          "index_analyzer": "standard",
          "search_analyzer": "standard"
        }
      }
    }
  }
}
PUT /multi_fields3
{
  "mappings": {
    "test": {
      "properties": {
        "body": {
          "type": "string",
          "index_analyzer": "keyword",
          "search_analyzer": "standard"
        }
      }
    }
  }
}

You see that all of them have the same type and the same field name - body - but different index_analyzers. Then you define an alias:

POST _aliases
{
  "actions": [
    {"add": {
        "index": "multi_fields1",
        "alias": "multi"}},
    {"add": {
        "index": "multi_fields2",
        "alias": "multi"}},
    {"add": {
        "index": "multi_fields3",
        "alias": "multi"}}
  ]
}

Name your alias the same as your current index. The application doesn't need to change, it will use the same name for index search, but this name will not point to an index, but to an alias which in turn refers to your multiple indices. What needs to change is how you index the documents, because a html documents needs to go in multi_fields1 index for example, an email document needs to be index in multi_fields2 index etc.

Whatever solution you find/choose, your requirements need to change because the way you want it is not possible.

4
votes

I think you can use multi-field. With multi-field you can define analyzers (both indexing & searching) for each sub fields, and do the search on corresponding fields base on applications requirements. In general, index analyzer can be difference from field to field, the same for search analyzer.

{
  "your_type" : {   
    "properties":{
        "body" : {
            "type" : "string",
            "index" : "analyzed",
            "index_analyzer" : "index_body_analyzer",
            "search_analyzer" : "search_body_analyzer",
            "fields" : {
                "mail" : {
                    "type" : "string",
                    "index" : "analyzed",
                    "index_analyzer" : "index_bodymail_analyzer",
                    "search_analyzer" : "search_bodymail_analyzer"
                },
                "html": {
                    "type" : "string",              
                    "index" : "analyzed",
                    "index_analyzer" : "index_bodyhtml_analyzer",
                    "search_analyzer" : "search_bodyhtml_analyzer"
                }
            }
        }
    }
}