3
votes

I'm using Lucene 4.4 to analyze a small corpus. I've tried StopAnalyzer and StopAnalyzer. However, many terms that I don't need still show up in my result. For example, "I'll", "we", "x", etc. So, I need to customize the stopword list provided by Lucene. My questions are:

  1. How do I add new stopwords? I know that Lucene has this constructor to use a customized stopwords

    public StopAnalyzer(Version matchVersion, CharArraySet stopWords)

    But I don't want to build stopword from scratch. I want to use the existing stopwords, and just add extra stopwords that I need.

  2. How can I filter out all the numbers, both as word and literal numbers, such as "1", "20", "five" , "ten", etc ?

My solution

  1. As femtoRgon showed, the stopword list provided by Lucene is very small and cannot be changed. I created a CustomizeStopAnalyzer that takes a list of stopwords. I use StandardTokenizer and chain a few filters together.
  2. To remove numbers, I have to add a NumericFilter class that check every token to see if it's numeric. Many thanks,
1

1 Answers

4
votes

1 - The standard stop word set is StopAnalyzer.ENGLISH_STOPWORD_SET. It is unmodifiable, so you should just copy the code as a starting point:

 final List<String> stopWords = Arrays.asList(
   "a", "an", "and", "are", "as", "at", "be", "but", "by",
   "for", "if", "in", "into", "is", "it",
   "no", "not", "of", "on", "or", "such",
   "that", "the", "their", "then", "there", "these",
   "they", "this", "to", "was", "will", "with"
 );
 final CharArraySet stopSet = new CharArraySet(Version.LUCENE_CURRENT, 
     stopWords, false);

2 - A stop filter isn't the right approach for this. I suspect, you are probably looking for something like LetterTokenizer, which will define tokens as consecutive strings of letter, thus eliminating any non-letter characters.