1
votes

I'm trying to inject latent dirichlet allocation(LDA) into scoring the relevancy of the search documents, and got stuck. I have only just got started with Lucene. I am using code from "Lucene in Action" to get started.

The plan is to try out a mixture of weightage of the default tf-idf model and the cosine similarity between the topic vectors of the query and each document. e.g 0.5 * tfidf + 0.5 * cos(Q,D)

I have tried storing the topic vector for each document during indexing, using a delimiter between each score index:

doc.add(new Field("lda score", "0.200|0.111|0.4999",
                  Field.Store.NO,
                  Field.Index.NOT_ANALYZED_NO_NORMS));

Then during searching:

//tfidf 
Query q = new QueryParser(Version.LUCENE_30,
                          "content",
                          new StandardAnalyzer(
                            Version.LUCENE_30))
             .parse("some text here");
FieldScoreQuery qf = new FieldScoreQuery("lda score",
                                         FieldScoreQuery.Type.BYTE);
CustomScoreQuery customQ = new CustomScoreQuery(q, qf) {
  public CustomScoreProvider getCustomScoreProvider(IndexReader r) {
    return new CustomScoreProvider(r) {
      public float customScore(int doc,
                               float tfidfScore,
                               float ldaScore) {
        return 0.5*tfidfScore + 0.5*ldaScore);
} };

Obviously, it is the FieldScoreQuery portion that I require help on. How do I read in the query string, run lda inferencing (analysis separate from lucene) and cosine similarity t churn out scores for the CustomScoreQuery to consume?

Is this the correct way to do this, or do I need to go into the Similarity classes? Some code samples to help me get started would be appreciated.

1

1 Answers

0
votes

As far as I know you cannot use a string as a FieldScoreQuery. If you need 3 values, use 3 fields an use 3 distinct FieldScoreQuery of type FLOAT.

I use NumericFields

luc_doc.add(new NumericField( FIELD_NAME,Field.Store.NO,true).setFloatValue( x ));

Then in the CustomScoreProvider implement overwrite the method

public float customScore(int doc, float subQueryScore, float[] valSrcScores)

where you will have your 3 values in the valSrcScores array.