7
votes

I am experimenting with Solr's MoreLikeThis feature.

My schema deals with articles, and I'm looking for similarities between articles within three fields: articletitle, articletext and topic.

The following query works well:

q=id:(2e2ec74c-7c26-49c9-b359-31a11ea50453)
&rows=100000000&mlt=true
&mlt.fl=articletext,articletitle,topic&mlt.boost=true&mlt.mindf=1&mlt.mintf=1

But I would like to experiment with boosting different query fields - i.e. putting more weight on similarities in the articletitle, for instance.

The documentation (http://wiki.apache.org/solr/MoreLikeThis) suggests that this can be achieved by including the mlt.qf property, with some boosting.

My attempt at such a query is as follows:

q=id:(2e2ec74c-7c26-49c9-b359-31a11ea50453)&rows=100000000&mlt=true
&mlt.fl=articletext,articletitle,topic&mlt.boost=true
&mlt.mindf=1&mlt.mintf=1
&mlt.qf=articletext^0.1 articletitle^100 topic^0.1

However, the boosts seem to have no affect - no matter what boosts I supply, the recommendations remain the same (I would except the above query to heavily favour similarities in the titles, but this doesn't seem to be happening)

I can't find any examples in the documentation that use MoreLikeThis in this way, which leads me to believe I've got something wrong.

Has anyone managed to achieve something like this?

2
I have just started running into this problem also. You can put the mlt.interestingTerms=details param in there too to see the terms it selects based on the original document, and they don't seem to change, no matter what I put in the mlt.fl parameter (or what fields I leave out).Simon

2 Answers

3
votes

The MLT component is useful if you have simple recommendation requirements where you have only one field to match on, or several of equal importance. But any time you want to vary the relative importances of the different fields, or need to do something more specific like include an inverse distance boost, then you will probably want to write your own pseudo MLT handler. All the MLT handler does is to generate the top terms from the fields specified based on their tf.idf scores from the source document. You can easily emulate that functionality in some code that generates a custom SOLR OR query. You will lose the advantage of the termvectors, but so long as your queries are reasonably sized (say < 20 terms) it will probably perform pretty well. We have a small index and so generate our own MLT queries with several hundred terms and it executes in an acceptable amount of time (a few ms). However, I have seen this behavior deteriorate somewhat on large indexes with a few 100 million documents and larger fields, and in those cases you need to restrict your query to a small number of top terms. Using your own code in place of MLT is more work, but you gain a lot more in flexibility.

1
votes

even iam facing with same problem in my case i have to find the similarity between documents using more like this on two fields 1) description and 2) columnValue where columnValue where columnValue is given more weightage than description in the final score. Since solr support only string type similarity match and it doesnot support for type double, so i have converted the columnValue field to string type.(i.e the double values are now of type string ex: 231.0 is now "231.0"). here is the schema:

enter image description here schema.xml

and the query which i am using

http://hostname:8983/solr/collection3/mlt?q= &wt=xml&indent=true&mlt=true&mlt.fl=description,columnValue &fq=versionId:1068383519&mlt.count=4000&mlt.mindf=1&mlt.mintf=1 &fl=tableId,tableIndex,score,versionId,columnId,description,columnValue, refVersionId &mlt.qf=description^0.4+columnValue^0.6

here "id" is composite key on refVersionId,VersionId,TableId,TableIndex,ColumnId

but problem is that columnValue boosting is not effective and i dont find any change in the response even if i remove the columnValue from mlt.fl and mlt.qf, columnValue is not participating in the similarity match. According to me the mlt is working only on single field i.e description. Do u have any suggestion or any solution to fix this issue.