0
votes

I have recently noticed that the behavior of the Lucene StandardAnalyzer have changed somewhat since version 3.1. Concretely, 3.0 and previous versions recognized e-mails, IP addresses, company names etc as separate lexical types, while later versions don't.

For example, for input text : "[email protected] 127.0.0.1 H&M", the 3.0 analyzer would recognize the following types:

1: [email protected]: 0->16: <EMAIL>

2: 127.0.0.1: 17->26: <HOST>

3: h&m: 27->30: <COMPANY>

However, version 3.1 and later give the following output for the same input text:

1: example: 0->7: <ALPHANUM>

2: mail.com: 8->16: <ALPHANUM>

3: 127.0.0.1: 17->26: <NUM>

My question is, how can I implement the old StandardAnalyzer behavior with newer version of the Lucene library? Are there some standard TokenFilters that can help me achieve this, or do I need to implement custom filters?

1

1 Answers

1
votes

See the javadocs for StandardAnalyzer: As of 3.1, StandardTokenizer implements Unicode text segmentation.... ClassicTokenizer and ClassicAnalyzer are the pre-3.1 implementations of StandardTokenizer and StandardAnalyzer.

Alternatively, you can pass LUCENE_30 version to StandardAnalyzer and you also get the previous behavior. Thats the purpose of these version constants, so that behavior stays consistent for existing users, and you decide when to upgrade your app to changed behavior.