1
votes

I'm new to elastic search. I'm trying to fix our search so that it will allow users to search on content within html tags. Currently, we're using a whitespace tokenizer because we need it to return results on hyphenated names. Consequently, aname123-suffix project is indexed as ["aname123-suffix", "project"] and a user search for "aname123-*" returns the correct results.

My problem arises because we also want to be able to search on content within html tags. So, for example for a project called <aname123>-suffix project, we'd like to be able to enter the search term <aname123>-* and get back the correct results.

The index has the correct tokens for a whitespace tokenizer, namely ["<aname123>-suffix", "project"] but when my search string is "\<aname123\>\-suffix" or "\\<aname123\\>\\-suffix" elastic search returns no results.

I think the solution lies either in

  1. modifying the search string so that elastic search returns <aname123>-suffix when I ask for it; or
  2. being able to index the content within the tag separately from the whitespace tokens, i.e. ["<aname123>-suffix", "project", "aname123", "suffix"]

So far I've been approaching it by changing the indexing, but I have not yet succeeded. A standard tokenizer will allow search results for content within tags, but it fails to return search results for aname123-*. Currently my analyzer settings look like this:

{  "analysis":
         { "analyzer":
              { "my_whitespace_analyzer" :
                  {"type": "custom"
                        {"tokenizer": "whitespace},
                        {"filter": ["standard", "lowercase", "stop"]}
                  }
              },
              { "my_tag_analyzer":
                  {"type": "custom"
                        {"tokenizer": "standard"},
                        {"filter": ["standard", "lowercase", "stop"]}
                   }
               }
           }
 }

I can create a custom char filter that strips out the < and the >, so my index contains aname123; but for some reason elastic search still does not return correct results when searching on <aname123>*. However, when I use instead a standard analyzer, the index contains aname123 and it returns the expected results for <aname123>* ... What is so special about angle brackets in elastic search?

1

1 Answers

0
votes

You may want to take a look at the html_strip character filter:

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html

An example from one of the elasticsearch developers is here:

https://gist.github.com/clintongormley/780895