1
votes

I am trying to figure out how does lucene's analyzer work? My question is how does lucene handle synonym words? Here is the situation: we have single words and multi words

single: foo = bar multi words: foo bar = foobar

For single words:

  • Does lucene expand the indexed records or not? I guess if a query has a word like "foo", it adds "bar" to the query too. I don't know if it happens for indexing or not?

For multi words:

  • Does lucene expand both query and indexing? for example if we have "foo bar", does it add foobar to the indexing/query?

My second question is : Lucene uses a stream of tokens and gives them to the filters like lowercase filter. My question is how does lucene find the multi words? like how does it find out that "foo bar" is a multi words that are together?

thanks

1

1 Answers

3
votes

SynonymFilter can, optionally, keep the original word, and add the synonym to the tokenstream as well, by setting keepOrig=true (see SynonymMap.Builder.add()). This behavior can cause problems for PhraseQueries and the like, see first Note on the SynonymFilter docs.

If you are using the same Analyzer for querying and indexing, then both queries and docs written to the index will, of course, be treated the same way. SynonymFilter with keepOrig set to true is one of the few Analyzers that is reasonably often applied incongruously between querying and indexing, but that is entirely up to your implementation.

As far as how it is implemented, the source code is available to you.