0
votes

I have indexed a property on OrientDB using Lucene's keyword analyzer:

CREATE INDEX Snippet.ssdeep ON Snippet (ssdeep) FULLTEXT ENGINE LUCENE METADATA {"analyzer":"org.apache.lucene.analysis.core.KeywordAnalyzer"}

The filed contains simhashes that I have indexed for test.

Now when I search using Lucene, I get a response for the exact queries, but not for the fuzzy queries (despite properly escaping the query text).

For instance, given a field with the value "192:d4e1GDZYDUZrw9AfCB+A66ancCZmx9n2P:2e1GW18A66ac/YP", the following query yields one record:

SELECT FROM Snippet WHERE ssdeep LUCENE "192\\:d4e1GDZYDUZrw9AfCB\\+A66ancCZmx9n2P\\:2e1GW18A66ac\\/YP"

While this query yields no records:

SELECT FROM Snippet WHERE ssdeep LUCENE "192\\:d4e1GDZYDUZrw9AfCB\\+A66ancCZmx9n2P\\:2e1GW18A66ac\\/YP~0.9"

I wonder what is preventing Lucene from finding approximative results? More particularly is it Lucene (or the KeywordAnalyzer) that is not apt in fuzzy searching such strings, or is it the interface between Lucene and OrientDB that is at cause?

i.e. I have other full text Lucene indexes on the same database that work, but all those fields contain ordinary text and are analyzed using Simple or Standard analyzers. This is the only field I really need a full text index on, and it fails to work.

1

1 Answers

1
votes

The problem is letter case. StandardAnalyzer, SimpleAnalyzer, and EnglishAnalyzer all lowercase text before indexing the terms. KeywordAnalyzer doesn't.

Since wildcard, fuzzy, and other expanded, multi-term queries aren't analyzed, the QueryParser, by default, lowercases these types of query.

I don't know much about what OrientDB exposes of Lucene to allow you to do this effectively, but the two best solutions in Lucene are:

  1. Disable the QueryParser lowercasing these types of queries:

    queryParser.setLowercaseExpandedTerms(false);
    
  2. Use a custom analyzer that combines a KeywordTokenizer with a LowerCaseFilter:

    public class LowercaseKeywordAnalyzer extends Analyzer {
        @Override
        protected TokenStreamComponents createComponents(String fieldName) {
            Tokenizer source = new KeywordTokenizer();
            TokenStream filter = new LowerCaseFilter(source);
            return new TokenStreamComponents(source, filter);
        }
    }
    

I know neither if nor how these are exposed in OrientDB, but hopefully that points you in the right direction.