I have recently noticed that the behavior of the Lucene StandardAnalyzer have changed somewhat since version 3.1. Concretely, 3.0 and previous versions recognized e-mails, IP addresses, company names etc as separate lexical types, while later versions don't.
For example, for input text : "[email protected] 127.0.0.1 H&M", the 3.0 analyzer would recognize the following types:
1: [email protected]: 0->16: <EMAIL>
2: 127.0.0.1: 17->26: <HOST>
3: h&m: 27->30: <COMPANY>
However, version 3.1 and later give the following output for the same input text:
1: example: 0->7: <ALPHANUM>
2: mail.com: 8->16: <ALPHANUM>
3: 127.0.0.1: 17->26: <NUM>
My question is, how can I implement the old StandardAnalyzer behavior with newer version of the Lucene library? Are there some standard TokenFilters that can help me achieve this, or do I need to implement custom filters?