I am evaluating a recommendation engine using precision and recall. So far, I have evaluated system using 4 different datasets and values of precision are 0.833, 0.857, 0.857 and 0.769. Values of recall for same data sets are 0.448, 0.875, 0.5504 and 0.512 respectively. How can I use these results to evaluate the recommendation engine under test? Should I apply standard CF on same dataset and check values or is there any standard benchmark of precision and recall to classify a recommendation system? for example, if precision is x and recall is y, then this algorithm should be discarded or accepted?