0
votes

I am working on a project which involves indexing files using Apache Lucene. While I am successfully able to index the files using Lucene but when I see the result, I get many abrupt words probably because I am not removing stop words while indexing.

I read on web that Lucene provides a way to remove the stop words while indexing files. How can I do that?

2
My answer describes how stop words work, and hopefully that helps, but based on your description of the problem, I'm not entirely confident that stop words are your problem. I don't know what results you are referring to when you say you "see the result", nor do I know what "abrupt words" are. If stop words don't turn out to be the problem, a more detailed description of the problem your seeing, preferrably with examples, might help solve it.femtoRgon

2 Answers

1
votes

Lucene's StandardAnalyzer includes a StopFilter that removes some typical stop words from anything passed through it. The standard list of english stop words is pretty short; some articles, pronouns and prepositions, mainly.

If you wish to define your own set of StopWords, the StandardAnalyzer has a couple of constructors allowing ou to pass in your own set of stop words, and particularly, this one. Simply create a CharArraySet containing the desired stop words, and pass it into that constructor and your on your way.

I believe most other typical analyzers have a constructor accepting the same arguments as well (at a glance, it looks like almost all of the language analyzers in analyzers-common follow that pattern)

Of course, be sure and use the same analyzer for both indexing and searching.

0
votes

If you will use standard analyzer or stop analyzer then stop words like "on, a, an, the" will automatically removed from indexing and you cannot perform searching with stop words. If you want to perform searching with stop words also like "was, is, on" you have to use whitespace analyzer or simple analyzer.