Well in the Information Retrieval context the items are handled in boolean manner i.e., they’re either relevant or non relevant. Mahout’s GenericRecommenderIRStatsEvaluator utilizes data splitter to make a set from already preferred (or bought in your case, bought) items which represent relevant items. In mahout’s case the selected items are top-n most preferred items. So, since the ratings are boolean it just selects n preferred items. I don’t believe this would make evaluation itself drastically more inaccurate than with normal five star ratings since buying is pretty strong sign of preference. So:
1) If you have managed to make recommendations then you are able to evaluate the recommendations using precision and recall as metrics.
2) I have used a random recommender as an benchmark (just an implementation of an mahout recommender which selects n random items). It usually produces pretty low precision and recall so if the algorithm has lower precision and recall than random recommender it probably should be ditched. Other metric that would I look in offline evaluation phase is reach, since recommender which produces recommendations only to 80 users out of 6000 active users is pretty useless.
Also it should be noted that in academic papers the precision and recall metrics have been criticized when used as a sole metric. In the end the user decides what is relevant and what is not relevant. And a recommender, which produces slightly lower than the other, is not necessarily worse than the other. For example more novel or serendipitous recommendations may lover precision and recall.