1
votes

I am using hibernate-search-3.2.1.Final and would like to parse my input into shingles. From what i can see in the documentation, ShingleAnalyzerWrapper seem to be exactly what I needed. I have tested with both WhitespaceAnalyzer, StandardAnalyzer, and SnowballAnalyzer as the default analyzer for the ShingleAnalyzerWrapper.

Version luceneVersion = Version.LUCENE_29;
SnowballAnalyzer keywordAnalyzer= new SnowballAnalyzer(luceneVersion, "English", StopAnalyzer.ENGLISH_STOP_WORDS_SET);
ShingleAnalyzerWrapper shingleAnalyzer = new ShingleAnalyzerWrapper(keywordAnalyzer, 4);
shingleAnalyzer.setOutputUnigrams(false);
QueryParser keywordParser = new QueryParser(luceneVersion, "keyword", keywordAnalyzer);
Query keywordQuery = keywordParser.parse(QueryParser.escape(keyword.toLowerCase()));

However, the query came back empty. I was expecting keyword like "hello world, this is Lucene" to result in shingles [hello world this is, world this is lucene, this is lucene]

Let me know if my expectation and usage of ShingleAnalyzerWrapper is correct.

Thanks, Ryan

2

2 Answers

2
votes

Maybe it's a copy/paste error, but in your code snippet, the shingleAnalyzer is not actually being used because you're passing the variable keywordAnalyzer to the query parser. What analyzer are you using at indexing time?

If you use an analyzer that filters out stop words as the delegate analyzer for ShingleAnalyzerWrapper, stop words ("this" and "is" in your example) will be dropped before the shingle analyzer has a chance to create shingles from them.

A good way to debug analyzers is to use something like AnalyzerUtils described in "Lucene in Action". You can get the sample code here: http://java.codefetch.com/example/in/LuceneInAction/src/lia/analysis/AnalyzerUtils.java

Nikita

1
votes

Thanks Nikita! Yes, it was an copy-n-paste error, though the correct version still does produce the right results.

Your link on AnalyzerUtils was a great help, as I was able to use the following code to generate Shingles:

ShingleAnalyzerWrapper shingleAnalyzer = new ShingleAnalyzerWrapper(4);
shingleAnalyzer.setOutputUnigrams(false);

TokenStream stream = shingleAnalyzer.tokenStream("contents", new StringReader("red dress shoes with black laces"));
ArrayList tokenList = new ArrayList();
while (true) {
    Token token = null;
    try {
        token = stream.next();
    } catch (IOException e) {
        e.printStackTrace();  
    }
    if (token == null) break;
        tokenList.add(token);
}

Which produces:

[(red dress,0,9,type=shingle), (red dress shoes,0,15,type=shingle,posIncr=0), (red dress shoes black,0,26,type=shingle,posIncr=0), (dress shoes,4,15,type=shingle), (dress shoes black,4,26,type=shingle,posIncr=0), (dress shoes black laces,4,32,type=shingle,posIncr=0), (shoes black,10,26,type=shingle), (shoes black laces,10,32,type=shingle,posIncr=0), (black laces,21,32,type=shingle)]

The problem was not with the ShingleAnalyzerWrapper itself, but the QueryParser. I will need some more digging to figure out what's the underlying cause, but you got me some where to start from.