5
votes

Elasticsearch has a built-in "highlight" function which allows you to tag the matched terms in the results (more complicated than it might at first sound, since the query syntax can include near matches etc.).

I have HTML fields, and Elasticsearch stomps all over the HTML syntax when I turn on highlighting.

Can I make it HTML-aware / HTML-safe when highlighting in this way?

I'd like the highlighting to apply to the text in the HTML document, and not to highlight any HTML markup which has matched the search, i.e. a search for "p" might highlight <p>p</p> -> <p><mark>p</mark></p>.

My fields are indexed as "type: string".

The documentation says:

Encoder:

An encoder parameter can be used to define how highlighted text will be encoded. It can be either default (no encoding) or html (will escape html, if you use html highlighting tags).

.. but that HTML-escapes my already HTML-encoded field, breaking things further.

Here are two example queries

  1. Using the default encoder:

The highlight tags are inserted inside other tags, i.e. <p> -> <<tag1>p</tag1>>:

curl -XPOST -H 'Content-type: application/json' "http://localhost:7200/myindex/_search?pretty" -d '
{
  "query": { "match": { "preview_html": "p" } },
  "highlight": {
    "pre_tags" : ["<tag1>"],
    "post_tags" : ["</tag1>"],
    "encoder": "default",
    "fields": {
      "preview_html" : {}
    }
  },
  "from" : 22, "size" : 1
}'

GIVES:
...
      "highlight" : {
        "preview_html" : [ "<<tag1>p</tag1> class=\"text\">TOP STORIES</<tag1>p</tag1>><<tag1>p</tag1> class=\"text\">Middle East</<tag1>p</tag1>><<tag1>p</tag1> class=\"text\">Syria: Developments in Syria are main story in Middle East</<tag1>p</tag1>>" ]
      }

...
  1. Using the html encoder:

The existing HTML syntax is escaped by elasticsearch, which breaks things, i.e. <p> -> &lt;<tag1>p</tag1>&gt;:

curl -XPOST -H 'Content-type: application/json' "http://localhost:7200/myindex/_search?pretty" -d '
{
  "query": { "match": { "preview_html": "p" } },
  "highlight": {
    "pre_tags" : ["<tag1>"],
    "post_tags" : ["</tag1>"],
    "encoder": "html",
    "fields": {
      "preview_html" : {}
    }
  },
  "from" : 22, "size" : 1
}'

GIVES:
...
      "highlight" : {
        "preview_html" : [ "&lt;<tag1>p</tag1> class=&quot;text&quot;&gt;TOP STORIES&lt;&#x2F;<tag1>p</tag1>&gt;&lt;<tag1>p</tag1> class=&quot;text&quot;&gt;Middle East&lt;&#x2F;<tag1>p</tag1>&gt;&lt;<tag1>p</tag1> class=&quot;text&quot;&gt;Syria: Developments in Syria are main story in Middle East&lt;&#x2F;<tag1>p</tag1>&gt;" ]
        }
      }

...
1
And you want to have it like <tag1><p></tag1>?Andrei Stefan
Also, can you provide the mapping for preview_html field?Andrei Stefan
@AndreiStefan: I have updated the question text to answer your two questions.Rich
I'm sorry but you didn't answer the questions. type: string is the only detail for that field? You don't have analyzers defined or similar? Also, you say not to highlight any HTML markup which has matched the search but in the example you are highlighting html tags (<p><mark>p</mark></p>), so which one is it?Andrei Stefan
1. Yes, "type: string" is the only mapping instruction I have given ES for that column. Any analyzers will be the default ones. 2. The example is showing that the "p" in the text is highlighted with "<mark>", but the "p" in the HTML tag name <p> is not highlighted (compare to the first example query shown in full in the question).Rich

1 Answers

7
votes

One way to achieve this is to use the html_strip char filter while analyzing preview_html field.
This would ensure that while matches would not occur on html markup and hence highlighting would ignore it to as shown in the example below.

Example:

put test
{
   "settings": {
      "index": {
         "analysis": {
            "char_filter": {
               "my_html": {
                  "type": "html_strip"
               }
            },
            "analyzer": {
               "my_html": {
                  "tokenizer": "standard",
                  "char_filter": [
                     "my_html"
                  ],
                  "type": "custom"
               }
            }
         }
      }
   }
}

put test/test/_mapping
{
   "properties": {
      "preview_html": {
         "type": "string",
         "analyzer": "my_html",
         "search_analyzer": "standard"
      }
   }
}

put test/test/1
{
    "preview_html": "<p> p </p>"
}

post test/test/_search
{
   "query": {
      "match": {
         "preview_html": "p"
      }
   },
   "highlight": {
      "fields": {
         "preview_html": {}
      }
   }
}

Results

 "hits": [
         {
            "_index": "test",
            "_type": "test",
            "_id": "1",
            "_score": 0.30685282,
            "_source": {
               "preview_html": "<p> p </p>"
            },
            "highlight": {
               "preview_html": [
                  "<p> <em>p</em> </p>"
               ]
            }
         }
      ]