I'm using Lucene 4.4 to analyze a small corpus. I've tried StopAnalyzer and StopAnalyzer. However, many terms that I don't need still show up in my result. For example, "I'll", "we", "x", etc. So, I need to customize the stopword list provided by Lucene. My questions are:
How do I add new stopwords? I know that Lucene has this constructor to use a customized stopwords
public StopAnalyzer(Version matchVersion, CharArraySet stopWords)
But I don't want to build stopword from scratch. I want to use the existing stopwords, and just add extra stopwords that I need.
How can I filter out all the numbers, both as word and literal numbers, such as "1", "20", "five" , "ten", etc ?
My solution
- As femtoRgon showed, the stopword list provided by Lucene is very small and cannot be changed. I created a CustomizeStopAnalyzer that takes a list of stopwords. I use StandardTokenizer and chain a few filters together.
- To remove numbers, I have to add a NumericFilter class that check every token to see if it's numeric. Many thanks,