Lucene - Querying multiple terms in a field

Question

For simplicity sake, consider two documents with the following fields and values:

RecordId: "12345"
CreatedAt: "27/02/1992"
Event: "Manchester, Dubai, Paris"
Event: "Manchester, Rome, Madrid"
Event: "Madrid, Sidney"


RecordId: "99999"
CreatedAt: "27/02/1992"
Event: "Manchester, Barcelona, Rome"
Event: "Rome, Paris"
Event: "Milan, Barcelona"

Is it possible to perform a search for multiple terms within a single instance of a "Event" field ?

Lets say I want to search for "Manchester" and "Paris" to appear in the same field. The second record contains "Manchester" and "Paris" but on different instances of the Event field, which is not supposed to be part of the resultset.

Ideally, the resultset would only be the first record (12345).

Hey, Pelican. Perhaps index each record (RecordID) once for each Event field with a suffix to RecordID for each one. In your example you would then have six indexes, 12345-1. 12345-2. 12345-3, etc. You would end up with a much bigger index and you would need to filter out duplicate hits (if, for example, you also had a "Manchester, Detroit, Paris" Event), but I think it would work. — Michael Gorsich
I see your point, but that approach in the long run would eventually give me nightmares. Nevertheless, it would work. — pelican_george
Yeah, I didn't make it a formal answer because it seems kludgy, even though it would work. If you go with that approach, please let me know. — Michael Gorsich
@MichaelGorsich Just to follow-up your comment, how would you perform a search to those fields during runtime, not being aware of their name values. (e.g 12345-1, 12345-2, 12345-3, etc) ? — pelican_george
In your example, plus the one in my first comment, the results for "Manchester" and "Paris" will get you 12345-1 and 12345-4. You initially accumulate all results, Then you lop the suffixes off (LastIndexOf()) and eliminate duplicates to reduce the results to 12345, so you end up with a single result, which you use to retrieve your document. — Michael Gorsich

AndyPook AndyPook · Accepted Answer · 2016-03-20T16:47:04

Depending on the analyser you use for the field (it would need to tokenise and remove the punctuation). You could use a slop phrase query.

"manchester paris"~2 should find just 12345. Depending on the number and order of values in each field you may need to use a larger slop.

The slop defines the number of "operations" on the phrase allowable to match. This can be reordering or additional terms within the phrase.

So "x y"~1 could match

"y x"
"x fred y"
but not "y fred x" (that would require two ops: swamp plus an addition)

For your need the slop probably ought to be equal to the maximum number of terms allowed in a field. I haven't worked it through but I think that would suffice even if you query for more than 2 terms.

Lucene - Querying multiple terms in a field

2 Answers