0
votes

I searched in Google but didn't find any clue related to implement analyzer in Xapian, even Xapian may not support using analyzer like lucene. In another word, I can't extend to support in Chinese. Am I right?

I searched in Xapian C++ API, only found TermGenerator which may related to extract word. There is a flag named FLAG_CJK_NGRAM, it can split UTF-8 CJK word, let's say ABCD, it will split it into AB, BC, CD and A, B, C, D. That's very simple and straightforward. I suppose I need a more accurate solution, it seems I need implement or migrate mature solution(like jieba) to Xapian. Am I right?

1

1 Answers

0
votes

The TermGenerator (and the QueryParser, which goes hand in hand) supports CJK ngram splitting, which is possibly what you're looking for, by adding FLAG_CJK_NGRAM. For TermGenerator, this is used by calling set_flags(); for QueryParser you pass flags in to parse_query() (it's common to boolean OR in new flags with FLAG_DEFAULT, otherwise you'll turn off features you probably want to keep on).

In all other respects you should be able to use Xapian as normal, such as in the practical example from the "getting started" guide. (Note that although the example is in python, this will work in other wrapped languages, and directly from C++. The source code for the getting started guide has the code examples in some other languages.)

From the documentation of FLAG_CJK_NGRAM:

With this enabled, spans of CJK characters are split into unigrams and bigrams, with the unigrams carrying positional information. Non-CJK characters are split into words as normal.

If you want to do something else, then you currently have to write your own term generation and query parsing code.