I'm new to elastic search. I'm trying to fix our search so that it will allow users to search on content within html tags. Currently, we're using a whitespace tokenizer because we need it to return results on hyphenated names. Consequently, aname123-suffix project
is indexed as ["aname123-suffix", "project"]
and a user search for "aname123-*"
returns the correct results.
My problem arises because we also want to be able to search on content within html tags. So, for example for a project called <aname123>-suffix project
, we'd like to be able to enter the search term <aname123>-*
and get back the correct results.
The index has the correct tokens for a whitespace tokenizer, namely ["<aname123>-suffix", "project"]
but when my search string is "\<aname123\>\-suffix"
or "\\<aname123\\>\\-suffix"
elastic search returns no results.
I think the solution lies either in
- modifying the search string so that elastic search returns
<aname123>-suffix
when I ask for it; or - being able to index the content within the tag separately from the whitespace tokens, i.e.
["<aname123>-suffix", "project", "aname123", "suffix"]
So far I've been approaching it by changing the indexing, but I have not yet succeeded. A standard tokenizer will allow search results for content within tags, but it fails to return search results for aname123-*
. Currently my analyzer settings look like this:
{ "analysis":
{ "analyzer":
{ "my_whitespace_analyzer" :
{"type": "custom"
{"tokenizer": "whitespace},
{"filter": ["standard", "lowercase", "stop"]}
}
},
{ "my_tag_analyzer":
{"type": "custom"
{"tokenizer": "standard"},
{"filter": ["standard", "lowercase", "stop"]}
}
}
}
}
I can create a custom char filter that strips out the < and the >, so my index contains aname123
; but for some reason elastic search still does not return correct results when searching on <aname123>*
. However, when I use instead a standard analyzer, the index contains aname123
and it returns the expected results for <aname123>*
... What is so special about angle brackets in elastic search?