3
votes

I'm new to Lucene. I have two documents and I would like to have an exact match for the document field called "keyword" (the field may occur multiple times within a document).

The first document contains the keyword "Annotation is cool". The second document contains the keyword "Annotation is cool too". How do I have to build the query such that only the first document is found, when I search for "Annotation is cool"?

I read something about "StringField" and that it is not tokenized. If I change the "keyword" field from "TextField" to "StringField" in the method "addDoc" then nothing will be found.

Here is my code:

private IndexWriter writer;

public void lucene() throws IOException, ParseException {
    // Build the index
    StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_42);
    Directory index = new RAMDirectory();
    IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_42,
            analyzer);
    this.writer = new IndexWriter(index, config);

    // Add documents to the index
    addDoc("Spring", new String[] { "Java", "JSP",
            "Annotation is cool" });
    addDoc("Java", new String[] { "Oracle", "Annotation is cool too" });

    writer.close();

    // Search the index
    IndexReader reader = DirectoryReader.open(index);
    IndexSearcher searcher = new IndexSearcher(reader);

    BooleanQuery qry = new BooleanQuery();

    qry.add(new TermQuery(new Term("keyword", "\"Annotation is cool\"")), BooleanClause.Occur.MUST);

    System.out.println(qry.toString());

    Query q = new QueryParser(Version.LUCENE_42, "title", analyzer).parse(qry.toString());

    int hitsPerPage = 10;
    TopScoreDocCollector collector = TopScoreDocCollector.create(
            hitsPerPage, true);

    searcher.search(q, collector);

    ScoreDoc[] hits = collector.topDocs().scoreDocs;

    for (int i = 0; i < hits.length; ++i) {
        int docId = hits[i].doc;
        Document doc = searcher.doc(docId);
        System.out.println((i + 1) + ". \t" + doc.get("title"));
    }

    reader.close();
}

private void addDoc(String title, String[] keywords) throws IOException {
    // Create new document
    Document doc = new Document();

    // Add title
    doc.add(new TextField("title", title, Field.Store.YES));

    // Add keywords
    for (int i = 0; i < keywords.length; i++) {
        doc.add(new TextField("keyword", keywords[i], Field.Store.YES));
    }

    // Add document to index
    this.writer.addDocument(doc);
}
1

1 Answers

8
votes

You problem is not in how you are indexing the field. The string field is the correct way to index the entire input as a single token. The problem is how you are searching. I really don't know what you are intending to accomplish with this logic, really.

BooleanQuery qry = new BooleanQuery();
qry.add(new TermQuery(new Term("keyword", "\"Annotation is cool\"")), BooleanClause.Occur.MUST);
//Great! You have a termQuery added to the parent BooleanQuery which should find your keyword just fine!

Query q = new QueryParser(Version.LUCENE_42, "title", analyzer).parse(qry.toString());
//Now all bets are off.

Query.toString() is a handy method of debugging, but it is not safe to assume that running the output text query through a QueryParser will regenerate the same query. The standard query parser really doesn't have much capability to express multiple words as a single term. The String version of this that you see will, I believe, look like:

keyword:"Annotation is cool"

Which will be interpreted as a PhraseQuery. A PhraseQuery will look for three consecutive terms, Annotation, is, and cool, But the way you have indexed this, you have a single term "Annotation is cool".

The solution is don't ever use logic like

 Query nuttyQuery = queryParser.parse(perfectlyGoodQuery.toString());
 searcher.search(nuttyQuery);

Instead, just search with the BooleanQuery you already created.

 searcher.search(perfectlyGoodQuery);