Indexing multilingual words in lucene

Question

I am trying to index in Lucene a field that could have RDF literal in different languages. Most of the approaches I have seen so far are:

Use a single index, where each document has a field per each language it uses, or
Use M indexes, M being the number of languages in the corpus.

Lucene 2.9+ has a feature called Payload that allows to attach attributes to term. Is anyone use this mechanism to store language (or other attributes such as datatypes) information ? How is performance compared to the two other approaches ? Any pointer on source code showing how it is done would help. Thanks.

Xodarap Xodarap · Accepted Answer · 2011-03-10T19:59:50

It depends.

Do you want to allow something like: "Search all english text for 'foo'"? If so, then you will need one field per language.
Or do you want "Search all text for 'foo' and present the user with which language the match was found in?" If this is what you want, then either payloads or separate fields will work.
An alternative way to do it is to index all your text in one field, then have another field saying the language of the document. (Assuming each document is in a single language.) Then your search would be something like +text:foo +language:english.

In terms of efficiency: you probably want to avoid payloads, since you would have to repeat the name of the language for every term, and you can't search based on payloads (at least not easily).

Indexing multilingual words in lucene

2 Answers