How do you tokenize the fields? Do you store them as complete string? Also, how do you parse the query?
Okay, so I am playing around a bit with this. I have been using a StopFilter to remove la, en, de. I then used a shingle filter to get multiple combination in order to do the "exact matches". So for example Bosc de Planavilla gets tokenized as [Bosc] [Bosc Planavilla] and Bosc de Plana en Blanca gets tokenized to [Bosc] [Bosc Plana] [Plana Blanca] [Bosc Plana Blanca]. This is so that you can have "exact matches" on parts of the query.
I then query the exact string the user passed, although there could be some adaptation there as well. I went with the simple case to make the results better match what you were looking for.
Here is code I am using (lucene 3.0.3):
public class ShingleFilterTests {
private Analyzer analyzer;
private IndexSearcher searcher;
private IndexReader reader;
public static Analyzer createAnalyzer(final int shingles) {
return new Analyzer() {
@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream tokenizer = new WhitespaceTokenizer(reader);
tokenizer = new StopFilter(false, tokenizer, ImmutableSet.of("de", "la", "en"));
if (shingles > 0) {
tokenizer = new ShingleFilter(tokenizer, shingles);
}
return tokenizer;
}
};
}
@Before
public void setUp() throws Exception {
Directory dir = new RAMDirectory();
analyzer = createAnalyzer(3);
IndexWriter writer = new IndexWriter(dir, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
ImmutableList<String> cities = ImmutableList.of("Bosc de Planavilla", "Planavilla", "Bosc de la Planassa",
"Bosc de Plana en Blanca");
ImmutableList<Integer> populations = ImmutableList.of(5000, 20000, 1000, 100000);
for (int id = 0; id < cities.size(); id++) {
Document doc = new Document();
doc.add(new Field("id", String.valueOf(id), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("city", cities.get(id), Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("population", String.valueOf(populations.get(id)),
Field.Store.YES, Field.Index.NOT_ANALYZED));
writer.addDocument(doc);
}
writer.close();
searcher = new IndexSearcher(dir);
reader = searcher.getIndexReader();
}
@After
public void tearDown() throws Exception {
searcher.close();
}
@Test
public void testShingleFilter() throws Exception {
System.out.println("shingle filter");
QueryParser qp = new QueryParser(Version.LUCENE_30, "city", createAnalyzer(0));
printSearch(qp, "city:\"Bosc de Planavilla\"");
printSearch(qp, "city:Planavilla");
printSearch(qp, "city:Bosc");
}
private void printSearch(QueryParser qp, String query) throws ParseException, IOException {
Query q = qp.parse(query);
System.out.println("query " + q);
TopDocs hits = searcher.search(q, 4);
System.out.println("results " + hits.totalHits);
int i = 1;
for (ScoreDoc dc : hits.scoreDocs) {
Document doc = reader.document(dc.doc);
System.out.println(i++ + ". " + dc + " \"" + doc.get("city") + "\" population: " + doc.get("population"));
}
System.out.println();
}
}
I am now looking into sorting per population.
This prints out:
query city:"Bosc Planavilla"
results 1
1. doc=0 score=1.143841 "Bosc de Planavilla" population: 5000
query city:Planavilla
results 2
1. doc=1 score=1.287682 "Planavilla" population: 20000
2. doc=0 score=0.643841 "Bosc de Planavilla" population: 5000
query city:Bosc
results 3
1. doc=0 score=0.5 "Bosc de Planavilla" population: 5000
2. doc=2 score=0.5 "Bosc de la Planassa" population: 1000
3. doc=3 score=0.375 "Bosc de Plana en Blanca" population: 100000