Is there a way to remove ALL special characters using Lucene filters?

Question

Standard Analyzer removes special characters, but not all of them (eg: '-'). I want to index my string with only alphanumeric characters but referring to the original document.

Example: 'doc-size type' should be indexed as 'docsize' and 'type' and both should point to the original document: 'doc-size type'

andrewJames andrewJames · Accepted Answer · 2020-04-07T20:06:27

It depends what you mean by "special characters", and what other requirements you may have. But the following may give you what you need, or point you in the right direction.

The following examples all assume Lucene version 8.4.1.

Basic Example

Starting with the very specific example you gave, where doc-size type should be indexed as docsize and type, here is a custom analyzer:

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.WhitespaceTokenizer;
import org.apache.lucene.analysis.pattern.PatternReplaceFilter;
import java.util.regex.Pattern;

public class MyAnalyzer extends Analyzer {

    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
        final Tokenizer source = new WhitespaceTokenizer();
        TokenStream tokenStream = source;
        Pattern p = Pattern.compile("\\-");
        boolean replaceAll = Boolean.TRUE;
        tokenStream = new PatternReplaceFilter(tokenStream, p, "", replaceAll);
        return new TokenStreamComponents(source, tokenStream);
    }  
}

This splits on whitespace, and then removes hyphens, using a PatternReplaceFilter. It works as shown below (I use ｢ and ｣ as delimiters to show where whitespaces may be part of the inputs/outputs):

Input text:
｢doc-size type｣

Output tokens:
｢docsize｣
｢type｣

NOTE - this will remove all hyphens which are standard keyboard hyphens - but not things such as em-dashes, en-dashes, and so on. It will remove these standard hyphens regardless of where they appear in the text (word starts, word ends, on their own, etc).

A Set of Punctuation Marks

You can change the pattern to cover more punctuation, as needed - for example:

Pattern p = Pattern.compile("[$^-]");

This does the following:

Input text:
｢doc-size type $foo^bar｣

Output tokens:
｢docsize｣
｢type｣
｢foobar｣

Everything Which is Not a Character or Digit

You can use the following to remove everything which is not a character or digit:

Pattern p = Pattern.compile("[^A-Za-z0-9]");

This does the following:

Input text:
｢doc-size 123 %^&*{} type $foo^bar｣

Output tokens:
｢docsize｣
｢123｣
｢｣
｢type｣
｢foobar｣

Note that this has one empty string in the resulting tags.

WARNING: Whether the above will work for you depends very much on your specific, detailed requirements. For example, you may need to perform extra transformations to handle upper/lowercase differences - i.e. the usual things which typically need to be considered when indexing text.

Note on the Standard Analyzer

The StandardAnalyzer actually does remove hyphens in words (with some obscure exceptions). In your question you mentioned that it does not remove them. The standard analyzer uses the standard tokenizer. And the standard tokenizer implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified here. There's a section discussing how hyphens in words are handled.

So, the Standard analyzer will do this:

Input text:
｢doc-size type｣

Output tokens:
｢doc｣
｢size｣
｢type｣

That should work with searches for doc as well as doctype - it's just a question of whether it works well enough for your needs.

I understand that may not be what you want. But if you can avoid needing to build a custom analyzer, life will probably be much simpler.