Lucene Query (with shingles ? )

Question

I have a Lucene Index containing documents like these :

_id     |           Name            |        Alternate Names      |    Population

123       Bosc de Planavilla               (some names here in          5000
345       Planavilla                       other languages)             20000
456       Bosc de la Planassa                                           1000
567       Bosc de Plana en Blanca                                       100000

What's the best Lucene query type I should use and how should I structure it considering I need the following :

If a user queries for : "Italian Restaurant near Bosc de Planavilla" I want document with id 123 to be returned because its contains an exact match with the name of the doc.
If a user queries for : "Italian Restaurant near Planavilla" I want document with id 345 because query contains an exact match AND it has the highest population.
If a user queries for "Italian Restaurant near Bosc" I want 567 because query contains "Bosc" AND of the 3 "Bosc" it has the highest pop.

there are probably many other use cases ... but you get the feeling of what i need ...

What kind of query will do this form me ? Should I generate word N grams (shingles) and create an ORed boolean query using the shingles then apply custom scoring ? or will a regular phrase query will do ? I also saw DisjunctionMaxQuery but dont know if its what im looking for ...

The idea, as you might have anderstood by now, is to find the exact Location a user implied in his query. From that I can start my Geo search and add some further querying around that.

What's the best approach ?

Thanks in advance .

wesen wesen · Accepted Answer · 2012-01-12T22:36:00

How do you tokenize the fields? Do you store them as complete string? Also, how do you parse the query?

Okay, so I am playing around a bit with this. I have been using a StopFilter to remove la, en, de. I then used a shingle filter to get multiple combination in order to do the "exact matches". So for example Bosc de Planavilla gets tokenized as [Bosc] [Bosc Planavilla] and Bosc de Plana en Blanca gets tokenized to [Bosc] [Bosc Plana] [Plana Blanca] [Bosc Plana Blanca]. This is so that you can have "exact matches" on parts of the query.

I then query the exact string the user passed, although there could be some adaptation there as well. I went with the simple case to make the results better match what you were looking for.

Here is code I am using (lucene 3.0.3):

public class ShingleFilterTests {
    private Analyzer analyzer;
    private IndexSearcher searcher;
    private IndexReader reader;

    public static Analyzer createAnalyzer(final int shingles) {
        return new Analyzer() {
            @Override
            public TokenStream tokenStream(String fieldName, Reader reader) {
                TokenStream tokenizer = new WhitespaceTokenizer(reader);
                tokenizer = new StopFilter(false, tokenizer, ImmutableSet.of("de", "la", "en"));
                if (shingles > 0) {
                    tokenizer = new ShingleFilter(tokenizer, shingles);
                }
                return tokenizer;
            }
        };
    }

    @Before
    public void setUp() throws Exception {
        Directory dir = new RAMDirectory();
        analyzer = createAnalyzer(3);

        IndexWriter writer = new IndexWriter(dir, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
        ImmutableList<String> cities = ImmutableList.of("Bosc de Planavilla", "Planavilla", "Bosc de la Planassa",
                                                               "Bosc de Plana en Blanca");
        ImmutableList<Integer> populations = ImmutableList.of(5000, 20000, 1000, 100000);

        for (int id = 0; id < cities.size(); id++) {
            Document doc = new Document();
            doc.add(new Field("id", String.valueOf(id), Field.Store.YES, Field.Index.NOT_ANALYZED));
            doc.add(new Field("city", cities.get(id), Field.Store.YES, Field.Index.ANALYZED));
            doc.add(new Field("population", String.valueOf(populations.get(id)),
                                     Field.Store.YES, Field.Index.NOT_ANALYZED));
            writer.addDocument(doc);
        }
        writer.close();

        searcher = new IndexSearcher(dir);
        reader = searcher.getIndexReader();
    }

    @After
    public void tearDown() throws Exception {
        searcher.close();
    }

    @Test
    public void testShingleFilter() throws Exception {
        System.out.println("shingle filter");

        QueryParser qp = new QueryParser(Version.LUCENE_30, "city", createAnalyzer(0));

        printSearch(qp, "city:\"Bosc de Planavilla\"");
        printSearch(qp, "city:Planavilla");
        printSearch(qp, "city:Bosc");
    }

    private void printSearch(QueryParser qp, String query) throws ParseException, IOException {
        Query q = qp.parse(query);

        System.out.println("query " + q);
        TopDocs hits = searcher.search(q, 4);
        System.out.println("results " + hits.totalHits);
        int i = 1;
        for (ScoreDoc dc : hits.scoreDocs) {
            Document doc = reader.document(dc.doc);
            System.out.println(i++ + ". " + dc + " \"" + doc.get("city") + "\" population: " + doc.get("population"));
        }
        System.out.println();
    }
}

I am now looking into sorting per population.

This prints out:

query city:"Bosc Planavilla"
results 1
1. doc=0 score=1.143841 "Bosc de Planavilla" population: 5000

query city:Planavilla
results 2
1. doc=1 score=1.287682 "Planavilla" population: 20000
2. doc=0 score=0.643841 "Bosc de Planavilla" population: 5000

query city:Bosc
results 3
1. doc=0 score=0.5 "Bosc de Planavilla" population: 5000
2. doc=2 score=0.5 "Bosc de la Planassa" population: 1000
3. doc=3 score=0.375 "Bosc de Plana en Blanca" population: 100000

Lucene Query (with shingles ? )

2 Answers