2
votes

I am using Lucene 3.5 Standard Analyzer for indexing and searching. Its working for all languages other than Chinese, Japanese and Korean languages. I tried with CJK Analyzer and Chinese Analyzers. But still not working. Index is getting created correctly. We have verified this with Luke tool. But not able to search the above language words, both using Luke tool and from code using Analyzers. Any solution for this.

伊拉克航空公司               

+name:伊拉克航空公司~0.9     This  is the lucene query generated by the analyzer for this chinese word. But not returning result. But other languages and its corresponding query is returning results
1
are you using any analyzer during query time ? show some examples of our index and query strings.user156327
edited the question with examplevishnu

1 Answers

2
votes

For Chinese, there are many useful 3rd party Analyzer such as:

  1. mmseg4j
  2. IK-analyzer
  3. ansj_seg
  4. imdict-chinese-analyzer

I recommend IK-analyzer, for example: Add this to your dependency:

    <dependency>
        <groupId>com.janeluo</groupId>
        <artifactId>ikanalyzer</artifactId>
        <version>2012_u6</version>
    </dependency>

The example code:

public class LuenceFirst {
    public static void main(String[] args) throws IOException {
        Analyzer analyzer = new IKAnalyzer(); 
        TokenStream tokenStream = analyzer.tokenStream("", "伊拉克航空公司");

        CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
        OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
        tokenStream.reset();
        while (tokenStream.incrementToken()) {
            System.out.println("start→" + offsetAttribute.startOffset());
            System.out.println(charTermAttribute);
            System.out.println("end→" + offsetAttribute.endOffset()); 
        }
        tokenStream.close();
    }
}

The output is: start→0

伊拉克

end→3

start→3

航空公司

end→7

start→3

航空

end→5

start→5

公司

end→7

For Japanese:

  1. koromoji
  2. lucene-gosen