1
votes

I am currently working on a small search engine for college using Lucene 8. I already built it before, but without applying any weights to documents.

I am now required to add the PageRanks of documents as a weight for each document, and I already computed the PageRank values. How can I add a weight to a Document object (not query terms) in Lucene 8? I looked up many solutions online, but they only work for older versions of Lucene. Example source

Here is my (updated) code that generates a Document object from a File object:

public static Document getDocument(File f) throws FileNotFoundException, IOException {
    Document d = new Document();

    //adding a field
    FieldType contentType = new FieldType();
    contentType.setStored(true);
    contentType.setTokenized(true);
    contentType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
    contentType.setStoreTermVectors(true);

    String fileContents = String.join(" ", Files.readAllLines(f.toPath(), StandardCharsets.UTF_8));
    d.add(new Field("content", fileContents, contentType));

    //adding other fields, then...

    //the boost coefficient (updated):
    double coef = 1.0 + ranks.get(path);
    d.add(new DoubleDocValuesField("boost", coef));

    return d;

}

The issue with my current approach is that I would need a CustomScoreQuery object to search the documents, but this is not available in Lucene 8. Also, I don't want to downgrade now to Lucene 7 after all the code I wrote in Lucene 8.


Edit:

After some (lengthy) research, I added a DoubleDocValuesField to each document holding the boost (see updated code above), and used a FunctionScoreQuery for searching as advised by @EricLavault. However, now all my documents have a score of exactly their boost, regardless of the query! How do I fix that? Here is my searching function:

public static TopDocs search(String query, IndexSearcher searcher, String outputFile) {
    try {
        Query q_temp = buildQuery(query); //the original query, was working fine alone

        Query q = new FunctionScoreQuery(q_temp, DoubleValuesSource.fromDoubleField("boost")); //the new query
        q = q.rewrite(DirectoryReader.open(bm25IndexDir));
        TopDocs results = searcher.search(q, 10);

        ScoreDoc[] filterScoreDosArray = results.scoreDocs;
        for (int i = 0; i < filterScoreDosArray.length; ++i) {
            int docId = filterScoreDosArray[i].doc;
            Document d = searcher.doc(docId);

            //here, when printing, I see that the document's score is the same as its "boost" value. WHY??
            System.out.println((i + 1) + ". " + d.get("path")+" Score: "+ filterScoreDosArray[i].score);
        }

        return results;
    }
    catch(Exception e) {
        e.printStackTrace();
        return null;
    }
}

//function that builds the query, working fine
public static Query buildQuery(String query) {
    try {
        PhraseQuery.Builder builder = new PhraseQuery.Builder();
        TokenStream tokenStream = new EnglishAnalyzer().tokenStream("content", query);
        tokenStream.reset();

        while (tokenStream.incrementToken()) {
          CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
          builder.add(new Term("content", charTermAttribute.toString()));
        }

        tokenStream.end(); tokenStream.close();
        builder.setSlop(1000);
        PhraseQuery q = builder.build();

        return q;
    }
    catch(Exception e) {
        e.printStackTrace();
        return null;
    }
}
2

2 Answers

1
votes

Starting from Lucene 6.5.0 :

Index-time boosts are deprecated. As a replacement, index-time scoring factors should be indexed into a doc value field and combined at query time using eg. FunctionScoreQuery. (Adrien Grand)

The recommendation instead of using index time boost would be to encode scoring factors (ie. length normalization factors) into doc values fields instead. (cf. LUCENE-6819)

1
votes

Regarding my edited problem (boost value completely replacing search score instead of boosting it), here is what the documentation says about FunctionScoreQuery (emphasis mine):

A query that wraps another query, and uses a DoubleValuesSource to replace or modify the wrapped query's score.

So, when does it replace, and when does it modify?

Turns out, the code I was using is for entirely replacing the score by the boost value:

Query q = new FunctionScoreQuery(q_temp, DoubleValuesSource.fromDoubleField("boost")); //the new query

What I needed to do instead was using the function boostByValue, that modifies the searching score (by multiplying the score by the boost value):

Query q = FunctionScoreQuery.boostByValue(q_temp, DoubleValuesSource.fromDoubleField("boost"));

And now it works! Thanks @EricLavault for the help!