1
votes

I would like to index some words with special characters all together.

For example, given m&m, I would like to index it as a whole, rather than delimiting it as m and m (normally & would be considered as a delimiter).

Is there a way to achieve this by using standard tokenizer/filter or should I have to write one myself?

3

3 Answers

3
votes

basically text field type filter out special characters before indexing. and you can use string type but it is not advisable for searching on it. you can use types option of WordDelimiterFilterFactory and you can convert those special characters to alphabetical

% => percent & => and

3
votes

A Standard Tokenizer factory splits/tokenizes the given text at special characters. To index with special characters you could either write your own custom tokenizer or you can do the following:

  • Take a list of characters, at which you want to tokenize/split the text. For eg, my list is {" ",";"}.
  • Use a PatternTokenizer with the above list of characters, instead of the StandardTokenizer. Your configuration will look like:

      <analyzer>
        <tokenizer class="solr.PatternTokenizerFactory" pattern=" |;" />
      </analyzer>
    
1
votes

you can use WhiteSpaceTokenizerFactory.

http://docs.lucidworks.com/display/solr/Tokenizers#Tokenizers-WhiteSpaceTokenizer

It will tokenize only on whitespaces. For example,

"m&m" will be considered as a single token and so it would indexed like that