I'm trying to inject latent dirichlet allocation(LDA) into scoring the relevancy of the search documents, and got stuck. I have only just got started with Lucene. I am using code from "Lucene in Action" to get started.
The plan is to try out a mixture of weightage of the default tf-idf model and the cosine similarity between the topic vectors of the query and each document. e.g 0.5 * tfidf + 0.5 * cos(Q,D)
I have tried storing the topic vector for each document during indexing, using a delimiter between each score index:
doc.add(new Field("lda score", "0.200|0.111|0.4999",
Field.Store.NO,
Field.Index.NOT_ANALYZED_NO_NORMS));
Then during searching:
//tfidf
Query q = new QueryParser(Version.LUCENE_30,
"content",
new StandardAnalyzer(
Version.LUCENE_30))
.parse("some text here");
FieldScoreQuery qf = new FieldScoreQuery("lda score",
FieldScoreQuery.Type.BYTE);
CustomScoreQuery customQ = new CustomScoreQuery(q, qf) {
public CustomScoreProvider getCustomScoreProvider(IndexReader r) {
return new CustomScoreProvider(r) {
public float customScore(int doc,
float tfidfScore,
float ldaScore) {
return 0.5*tfidfScore + 0.5*ldaScore);
} };
Obviously, it is the FieldScoreQuery
portion that I require help on. How do I read in the query string, run lda inferencing (analysis separate from lucene) and cosine similarity t churn out scores for the CustomScoreQuery
to consume?
Is this the correct way to do this, or do I need to go into the Similarity
classes? Some code samples to help me get started would be appreciated.