0
votes

I am trying to index in Lucene a field that could have RDF literal in different languages. Most of the approaches I have seen so far are:

  • Use a single index, where each document has a field per each language it uses, or

  • Use M indexes, M being the number of languages in the corpus.

Lucene 2.9+ has a feature called Payload that allows to attach attributes to term. Is anyone use this mechanism to store language (or other attributes such as datatypes) information ? How is performance compared to the two other approaches ? Any pointer on source code showing how it is done would help. Thanks.

2

2 Answers

0
votes

It depends.

  1. Do you want to allow something like: "Search all english text for 'foo'"? If so, then you will need one field per language.
  2. Or do you want "Search all text for 'foo' and present the user with which language the match was found in?" If this is what you want, then either payloads or separate fields will work.
  3. An alternative way to do it is to index all your text in one field, then have another field saying the language of the document. (Assuming each document is in a single language.) Then your search would be something like +text:foo +language:english.

In terms of efficiency: you probably want to avoid payloads, since you would have to repeat the name of the language for every term, and you can't search based on payloads (at least not easily).

0
votes

so basically lucene is a ranking algorithm, it just looks at strings and compares them to other string. they can be encoded in different character encodings but their similarity is the same non the less. Just make sure you load the SnowBallAnalyzer with the supported langugage stemmer and you should get results. Like say Spanish or Chinese