Removing stop words while indexing files using Apache Lucene

Question

I am working on a project which involves indexing files using Apache Lucene. While I am successfully able to index the files using Lucene but when I see the result, I get many abrupt words probably because I am not removing stop words while indexing.

I read on web that Lucene provides a way to remove the stop words while indexing files. How can I do that?

My answer describes how stop words work, and hopefully that helps, but based on your description of the problem, I'm not entirely confident that stop words are your problem. I don't know what results you are referring to when you say you "see the result", nor do I know what "abrupt words" are. If stop words don't turn out to be the problem, a more detailed description of the problem your seeing, preferrably with examples, might help solve it. — femtoRgon

femtoRgon femtoRgon · Accepted Answer · 2013-02-28T06:20:50

Lucene's StandardAnalyzer includes a StopFilter that removes some typical stop words from anything passed through it. The standard list of english stop words is pretty short; some articles, pronouns and prepositions, mainly.

If you wish to define your own set of StopWords, the StandardAnalyzer has a couple of constructors allowing ou to pass in your own set of stop words, and particularly, this one. Simply create a CharArraySet containing the desired stop words, and pass it into that constructor and your on your way.

I believe most other typical analyzers have a constructor accepting the same arguments as well (at a glance, it looks like almost all of the language analyzers in analyzers-common follow that pattern)

Of course, be sure and use the same analyzer for both indexing and searching.

Removing stop words while indexing files using Apache Lucene

2 Answers