Lucene mutli-language analyzer/index approach

Question

I have a working Lucene index supporting a suggestion service. When a user types into a search box it queries the index by the SUGGESTION_FIELD. Each entry in SUGGESTION_FIELD can be one of many supported languages and each is stored using an appropriate language specific analyzer. In order to know what analyzer was used there is second field per entry which stores the LOCALE. So during a query I can say something like the code below to do a language specific query using appropriate analyzer

QueryParser parser = new QueryParser(Version.LUCENE_33, SUGGESTION_FIELD, getLangaugeAnalyzer(locale));
return searcher.search(parser.parse("SUGGESTION_FIELD:" + queryString + " AND LOCALE:"
                + locale), 100);

The works.... But now the client wants to be able to search using multiple languages at once.

My Question: What would be the fastest querying solution bearing in mind that a suggestion service needs to be very fast?...

Sol. #1. The simplest solution would seem to be; do the query multiple times. Once for each locale, thereby applying the corresponding language analyser each time. Finally append the results from each query in some sensible fashion

Sol. #2. Alternatively I could re-index using a column for each locale such that:

SUGGESTION_FIELD_en, SUGGESTION_FIELD_fr, SUGGESTION_FIELD_es etc..

using a different analyzer for each field (using PerFieldAnalyzerWrapper) and then query using a more complex query string such that:

"SUGGESTION_FIELD_en:" + queryString + " AND SUGGESTION_FIELD_fr:" + queryString + " AND SUGGESTION_FIELD_es:" + queryString

Please help if you think you :)

From what you've specified it doesn't seem like a lot of work -- why not just OR together all the locales you need? — Marko Topolnik
Thanks. Because I'm not convinced that it will run quicker when the index scales up. Your right though its not a lot of work. Unfortunately though I'm working from a test Database with not a lot of data so I cant be sure which will fare better when the index eventually gets very large in the production environment. So I'm interested in people opinions :) — Rob McFeely
Your query is going to be something like this: sugField:queryString AND (locale:loc1 OR locale:loc2 OR ...). This is a BooleanQuery composed of TermQueries, with one of the terms mandatory. This term is also rare in the index and Lucene knows this at the outset (it checks the total doc count for each Term given) so it will know to first constrain the result by the queryString and then additionally intersect that with the locale terms. This will be VERY efficient no matter how large your index. — Marko Topolnik
@Marko Topolnik Thanks for your response. Unfortunately this approach (of a single query) wont work as I need to choose a language analyser when querying the sugField. So while the boolean: sugg AND (loc OR loc OR loc ...) is querying the correct parts of the index efficiently it is applying the same blanket analyser for each part. Which is not what I want. This why my sol #1 above used multiple calls (one per locale) so that I could apply the corresponding analyser each time — Rob McFeely

Marko Topolnik Marko Topolnik · Accepted Answer · 2012-04-04T13:02:07

Your query is going to be something like this: (sugField:queryString1 AND locale:loc1) OR (sugField:queryString2 AND locale:loc2) OR .... This is a top-level BooleanQuery with subordinate BooleanQueries added with occurs=SHOULD, where each subordinate query has its terms with occurs=MUST. The queryString1, queryString2, etc. are the outputs from different language analyzers having the same input, the string the user entered.

Each subordinate query involves mandatory terms (from your query string) that are rare in the index and Lucene knows this at the outset (it knows the total doc count for each Term in the index) so it will first constrain the result by the queryString and then additionally intersect that with the locale terms. This will be VERY efficient no matter how large your index.

As for the different analyzers, I suggest you don't use the QueryParser, but create the entire query programmatically. This is a good general advice whenever you don't enter the query by hand and in your case it is the only way to gain control of the analyzing aspect. Run your query string through each of the language-specific analyzers and add their output tokens as TermQueries to the subordinate BooleanQueries.

Lucene mutli-language analyzer/index approach

1 Answers