Optimize a query which computes Jackard similarity between two nodes from huge data sets

Question

I have billions of nodes labelled Profile (:Profile {member_id, name, gender}) from which I need to compute Jaccard index between them & create a similarity relationship and assign the index as property. There are Contact relationship between Male profile node to Female profile node and vice versa.

Below is the CQL :

Indexed gender.

MATCH (u1:Profile {gender:"Male"}), (u2:Profile {gender:"Male"}) WHERE u1 <> u2
MATCH (u1)-[:CONTACTED]->(u3:Profile {gender:"Female"})<-[:CONTACTED]-(u2) WITH u1, u2, count(u3.member) as intersect
MATCH (u1)-[:CONTACTED]->(u1_f:Profile {gender:"Female"}) WITH u1, u2, intersect, collect(DISTINCT u1_f.member) AS coll1
MATCH (u2)-[:CONTACTED]->(u2_f:Profile {gender:"Female"}) WITH u1, u2, collect(DISTINCT u2_f.member) AS coll2, coll1, intersect
WITH u1, u2, intersect, coll1, coll2, length(coll1 + filter(x IN coll2 WHERE NOT x IN coll1)) as union
Where (1.0*intersect/union) > 0
CREATE Unique (u1)-[:SIMILARITY {score: (1.0*intersect/union)}]-(u2);

If I execute this with a limit of 5 it takes approx 5mins to yield the results which is not feasible at all. What can I do to speed up the execution time as this is an important part of my project?

I thought something like below would work but it made it worse.

Created a constraint on member.

LOAD CSV WITH HEADERS FROM "file:{path_to_csv}/member_to_member.csv" AS row
MATCH (u1:Profile {member: row.sentby}), (u2:Profile {gender:"Male"}) WHERE u1 <> u2 AND row.status = "Contacted" AND row.sentbygender = "Male"
MATCH (u1)-[:CONTACTED]->(u3:Profile {member: row.recdby})<-[:CONTACTED]-(u2) WITH row, u1, u2, count(u3.member) as intersect
//WHERE intersect>0
MATCH (u1)-[:CONTACTED]->(u1_f:Profile {member: row.recdby}) WITH row, u1, u2, intersect, collect(DISTINCT u1_f.member) AS coll1
MATCH (u2)-[:CONTACTED]->(u2_f:Profile {member: row.recdby}) WITH row, u1, u2, collect(DISTINCT u2_f.member) AS coll2, coll1, intersect
WITH u1, u2, intersect, coll1, coll2, length(coll1 + filter(x IN coll2 WHERE NOT x IN coll1)) as union
return u1.member, u2.member, (1.0*intersect/union) as score limit 5;

member_to_member.csv

sentby,sentbygender,recdby,recdbygender,date_of_contact,status
OSH34878034,Male,angella,Female,2013-11-12,Contacted
OSH34878034,Male,AnshuSharma,Female,2013-11-12,Contacted
OSH34878034,Male,GSH26933499,Female,2013-11-12,Contacted
OSH34878034,Male,4SH00112696,Female,2013-11-12,Contacted
OSH34878034,Male,0308heinz,Female,2013-11-12,Contacted
OSH34878034,Male,8SH93301323,Female,2013-11-12,Contacted
OSH34878034,Male,098w,Female,2013-11-12,Contacted

Source : http://www.lyonwj.com/twizzard-a-tweet-recommender-system-using-neo4j/

Note: Above query is only to find Male -> Male similarity

Thanks

Which Neo4j version are you using? Can you try it on 2.2 with profiling and share the profile results? — Michael Hunger
I'm using Neo4j enterprise 2.1.7. I can share the development CSVs with you. — sudarshan poojary

Michael Hunger Michael Hunger · Accepted Answer · 2015-03-20T11:07:28

Use Java instead of Cypher.

Your first line already creates 1bn squared rows.

I'd probably create one big long array for all the profiles

I'd fold the 3 counters into one long (bitmask)

Then go over all the CONTACTED relationships for females (I would also promote the Gender to a Label)

And for each rel find the appropriate end-node entry by node-id index and increment the one of the 3 counters.

That should give you an array with the raw numbers and you can compute the results over that.

Optimize a query which computes Jackard similarity between two nodes from huge data sets

2 Answers