Elasticsearch has a built-in "highlight" function which allows you to tag the matched terms in the results (more complicated than it might at first sound, since the query syntax can include near matches etc.).
I have HTML fields, and Elasticsearch stomps all over the HTML syntax when I turn on highlighting.
Can I make it HTML-aware / HTML-safe when highlighting in this way?
I'd like the highlighting to apply to the text in the HTML document, and not to highlight any HTML markup which has matched the search, i.e. a search for "p" might highlight <p>p</p>
-> <p><mark>p</mark></p>
.
My fields are indexed as "type: string
".
The documentation says:
Encoder:
An encoder parameter can be used to define how highlighted text will be encoded. It can be either default (no encoding) or html (will escape html, if you use html highlighting tags).
.. but that HTML-escapes my already HTML-encoded field, breaking things further.
Here are two example queries
- Using the default encoder:
The highlight tags are inserted inside other tags, i.e. <p>
-> <<tag1>p</tag1>>
:
curl -XPOST -H 'Content-type: application/json' "http://localhost:7200/myindex/_search?pretty" -d '
{
"query": { "match": { "preview_html": "p" } },
"highlight": {
"pre_tags" : ["<tag1>"],
"post_tags" : ["</tag1>"],
"encoder": "default",
"fields": {
"preview_html" : {}
}
},
"from" : 22, "size" : 1
}'
GIVES:
...
"highlight" : {
"preview_html" : [ "<<tag1>p</tag1> class=\"text\">TOP STORIES</<tag1>p</tag1>><<tag1>p</tag1> class=\"text\">Middle East</<tag1>p</tag1>><<tag1>p</tag1> class=\"text\">Syria: Developments in Syria are main story in Middle East</<tag1>p</tag1>>" ]
}
...
- Using the
html
encoder:
The existing HTML syntax is escaped by elasticsearch, which breaks things, i.e. <p>
-> <<tag1>p</tag1>>
:
curl -XPOST -H 'Content-type: application/json' "http://localhost:7200/myindex/_search?pretty" -d '
{
"query": { "match": { "preview_html": "p" } },
"highlight": {
"pre_tags" : ["<tag1>"],
"post_tags" : ["</tag1>"],
"encoder": "html",
"fields": {
"preview_html" : {}
}
},
"from" : 22, "size" : 1
}'
GIVES:
...
"highlight" : {
"preview_html" : [ "<<tag1>p</tag1> class="text">TOP STORIES</<tag1>p</tag1>><<tag1>p</tag1> class="text">Middle East</<tag1>p</tag1>><<tag1>p</tag1> class="text">Syria: Developments in Syria are main story in Middle East</<tag1>p</tag1>>" ]
}
}
...
<tag1><p></tag1>
? – Andrei Stefanpreview_html
field? – Andrei Stefantype: string
is the only detail for that field? You don't have analyzers defined or similar? Also, you say not to highlight any HTML markup which has matched the search but in the example you are highlighting html tags (<p><mark>p</mark></p>
), so which one is it? – Andrei Stefan