3
votes

I am server engineer in company that provide dating service. Currently I am building a PoC for our new recommendation engine. I try to use neo4j. But performance of this database does not meet our needs. I have strong feeling that I am doing something wrong and neo4j can do much better. So can someone give me an advice how to improve performance of my Cypher’s query or how to tune neo4j in right way? I am using neo4j-enterprise-2.3.1 which is running on c4.4xlarge instance with Amazon Linux. In our dataset each user can have 4 types of relationships with others users - LIKE, DISLIKE, BLOCK and MATCH. Also he has a properties like countryCode, birthday and gender.

I made import of all our users and relationships from RDBMS to neo4j using neo4j-import tool. So each user is a node with properties and each reference is a relationship.

The report from neo4j-import tool said that :

2 558 667 nodes,

1 674 714 539 properties and

1 664 532 288 relationships

were imported.

So it’s huge DB :-) In our case some nodes can have up to 30 000 outgoing relationships..

I made 3 indexes in neo4j :

Indexes
ON :User(userId)           ONLINE  
ON :User(countryCode)      ONLINE  
ON :User(birthday)         ONLINE  

Then I try to build online recommendation engine using this query :

MATCH (me:User {userId: {source_user_id} })-[:LIKE | :MATCH]->()<-[:LIKE |  :MATCH]-(similar:User)
USING INDEX me:User(userId)
USING INDEX similar:User(birthday)
WHERE similar.birthday >= {target_age_gte} AND
      similar.birthday <= {target_age_lte} AND
      similar.countryCode = {target_country_code} AND
      similar.gender = {source_gender}
WITH similar, count(*) as weight ORDER BY weight DESC 
SKIP {skip_similar_person} LIMIT {limit_similar_person}
MATCH (similar)-[:LIKE | :MATCH]-(recommendation:User)
WITH recommendation, count(*) as sheWeight
WHERE recommendation.birthday >= {recommendation_age_gte} AND
      recommendation.birthday <= {recommendation_age_lte} AND
      recommendation.gender= {target_gender}
WITH recommendation, sheWeight ORDER BY sheWeight DESC 
SKIP {skip_person} LIMIT {limit_person}
MATCH (me:User {userId: {source_user_id} })
WHERE NOT ((me)--(recommendation))
RETURN recommendation

here is the execution plan for one of the user : plan

When I executed this query for list of users I had the result :

count=2391, min=4565.128849, max=36257.170065, mean=13556.750555555178, stddev=2250.149335254768, median=13405.409811, p75=15361.353029999998, p95=17385.136478, p98=18040.900481, p99=18426.811424, p999=19506.149138, mean_rate=0.9957385490980866, m1=1.2148195797996817, m5=1.1418078036067119, m15=0.9928564378521962, rate_unit=events/second, duration_unit=milliseconds

So even the fastest is too slow for Real-time recommendations..

Can you tell me what I am doing wrong?

Thanks.

EDIT 1 : plan with the expanded boxes : plan

3
Can you upload the plan with the boxes expanded?Brian Underwood
uploaded expanded planMike
Hey Mike, can you drop me an email, michael at neo4j.com, would love to get access to your database to help you with your query.Michael Hunger

3 Answers

2
votes

I built an unmanaged extension to see if I could do better than Cypher. You can grab it here => https://github.com/maxdemarzi/social_dna

This is a first shot, there are a couple of things we can do to speed things up. We can pre-calculate/save similar users, cache things here and there, and random other tricks. Give it a shot, let us know how it goes.

Regards, Max

0
votes

If I'm reading this right, it's finding all matches for users by userId and separately finding all matches for users by your various criteria. It's then finding all of the places that they come together.

Since you have a case where you're starting on the left with a single node, my guess is that we'd be better served by following the paths and then filtering what it gotten via relationship traversal.

Let's see how starting like this works for you:

MATCH
  (me:User {userId: {source_user_id} })-[:LIKE | :MATCH]->()
  <-[:LIKE |  :MATCH]-(similar:User)
WITH similar
WHERE similar.birthday >= {target_age_gte} AND
      similar.birthday <= {target_age_lte} AND
      similar.countryCode = {target_country_code} AND
      similar.gender = {source_gender}
WITH similar, count(*) as weight ORDER BY weight DESC 
SKIP {skip_similar_person} LIMIT {limit_similar_person}



MATCH (similar)-[:LIKE | :MATCH]-(recommendation:User)
WITH recommendation, count(*) as sheWeight
WHERE recommendation.birthday >= {recommendation_age_gte} AND
      recommendation.birthday <= {recommendation_age_lte} AND
      recommendation.gender= {target_gender}
WITH recommendation, sheWeight ORDER BY sheWeight DESC 
SKIP {skip_person} LIMIT {limit_person}
MATCH (me:User {userId: {source_user_id} })
WHERE NOT ((me)--(recommendation))
RETURN recommendation
0
votes

[UPDATED]

One possible (and nonintuitive) cause of inefficiency in your query is that when you specify the similar:User(birthday) filter, Cypher uses an index seek with the :User(birthday) index (and additional tests for countryCode and gender) to find all possible DB matches for similar. Let's call that large set of similar nodes A.

Only after finding A does the query filter to see which of those nodes are actually connected to me, as specified by your MATCH pattern.

Now, if there are relatively few me to similar paths (as specified by the MATCH pattern, but without considering its WHERE clause) as compared to the size of A -- say, 2 or more orders of magnitude smaller -- then it might be faster to remove the :User label from similar (since I presume they are probably all going to be users anyway, in your data model), and also remove the USING INDEX similar:User(birthday) clause. In this case, not using the index for similar may actually be faster for you, since you will only be using the WHERE clause on a relatively small set of nodes.

The same considerations also apply to the recommendation node.

Of course, this all has to be verified by testing on your actual data.