I'm very new to Neo4j/Graph databases and trying to replicate tutorial from Cypher cookbook: http://docs.neo4j.org/chunked/stable/cypher-cookbook-similarity-calc.html
Random data set with 100 foods and 1500 persons, all persons are related to food through ATE relationship with "times" integer property. Food and Person are labeled and have properties "name" - which is indexed by auto-index
neo4j-sh (?)$ dbinfo -g "Primitive count"
{
"NumberOfNodeIdsInUse": 1600,
"NumberOfPropertyIdsInUse": 151600,
"NumberOfRelationshipIdsInUse": 150000,
"NumberOfRelationshipTypeIdsInUse": 1
}
neo4j-sh (?)$ index --indexes
Node indexes:
node_auto_index
Relationship indexes:
relationship_auto_index
Running modified query from cookbook in neo4j-shell never completes (probably because of too much nodes/relationships?):
EXPORT name="Florida Goyette"
MATCH (me:Person { name: {name}})-[r1:ATE]->(food)<-[r2:ATE]-(you:Person)
WITH me,count(DISTINCT r1) AS H1,count(DISTINCT r2) AS H2,you
MATCH (me)-[r1:ATE]->(food)<-[r2:ATE]-(you)
RETURN SUM((1-ABS(r1.times/H1-r2.times/H2))*(r1.times+r2.times)/(H1+H2)) AS similarity
LIMIT 100;
So I started to look how can I limit earlier to "first" 100 persons and came out with this:
EXPORT name="Florida Goyette"
MATCH (me:Person { name: {name} })-[r1:ATE]->(food)
WITH me, food
MATCH (food)<-[r2:ATE]-(you)
WHERE me <> you
WITH me, you
LIMIT 100
MATCH (me)-[r1:ATE]->(food)<-[r2:ATE]-(you)
WITH me, count(DISTINCT r1) AS H1, count(DISTINCT r2) AS H2, you
MATCH (me)-[r1:ATE]->(food)<-[r2:ATE]-(you)
WITH me, you, SUM((1-ABS(r1.times/H1-r2.times/H2))*(r1.times+r2.times)/(H1+H2)) AS similarity
RETURN me.name, you.name, similarity
ORDER BY similarity DESC;
But this query performs very poorly on warmed up cache
100 rows
16038 ms
Is there any chance to make such query to perform faster, for "real-time" usage?
System and Neo4j
Windows 7 (64-bit), Intel Core I7-2600K, 8GB RAM, Neo4j database on SSD drive.
Neo4j Community version: 2.1.0-M01 (also tested on 2.0.1 stable)
neo4j-community.options
-Xmx2048m
-Xms2048m
neo4j.properties
neostore.nodestore.db.mapped_memory=200M
neostore.relationshipstore.db.mapped_memory=200M
neostore.propertystore.db.mapped_memory=200M
neostore.propertystore.db.strings.mapped_memory=330M
neostore.propertystore.db.arrays.mapped_memory=330M
node_auto_indexing=true
node_keys_indexable=name
relationship_auto_indexing=true
relationship_keys_indexable=times
Cypher dump of my data (503kb zipped)
PROFILE output
ColumnFilter(symKeys=["similarity", "you", "you.name", "me", "me.name"], returnItemNames=["me.name", "you.name", "similarity"], _rows=100, _db_hits=0)
Sort(descr=["SortItem(similarity,false)"], _rows=100, _db_hits=0)
Extract(symKeys=["me", "you", "similarity"], exprKeys=["me.name", "you.name"], _rows=100, _db_hits=200)
ColumnFilter(symKeys=["me", "you", " INTERNAL_AGGREGATEcb085cf5-8982-4a83-ba3d-9642de570c59"], returnItemNames=["me", "you", "similarity"], _rows=100, _db_hits=0)
EagerAggregation(keys=["me", "you"], aggregates=["(INTERNAL_AGGREGATEcb085cf5-8982-4a83-ba3d-9642de570c59,Sum(Divide(Multiply(Subtract(Literal(1),AbsFunction(Subtract(Divide(Property(r1,times(1)),H1),Divide(Property(r2,times(1)),H2)))),Add(Property(r1,times(1)),Property(r2,times(1)))),Add(H1,H2))))"], _rows=100, _db_hits=40000)
SimplePatternMatcher(g="(you)-['r2']-(food),(me)-['r1']-(food)", _rows=10000, _db_hits=0)
ColumnFilter(symKeys=["me", "you", " INTERNAL_AGGREGATE677cd11c-ae53-4d7b-8df6-732ffed28bbf", " INTERNAL_AGGREGATEb5eb877c-de01-4e7a-9596-03cd94cfa47a"], returnItemNames=["me", "H1", "H2", "you"], _rows=100, _db_hits=0)
EagerAggregation(keys=["me", "you"], aggregates=["( INTERNAL_AGGREGATE677cd11c-ae53-4d7b-8df6-732ffed28bbf,Distinct(Count(r1),r1))", "( INTERNAL_AGGREGATEb5eb877c-de01-4e7a-9596-03cd94cfa47a,Distinct(Count(r2),r2))"], _rows=100, _db_hits=0)
SimplePatternMatcher(g="(you)-['r2']-(food),(me)-['r1']-(food)", _rows=10000, _db_hits=0)
ColumnFilter(symKeys=["me", "food", "you", "r2"], returnItemNames=["me", "you"], _rows=100, _db_hits=0)
Slice(limit="Literal(100)", _rows=100, _db_hits=0)
Filter(pred="NOT(me == you)", _rows=100, _db_hits=0)
SimplePatternMatcher(g="(you)-['r2']-(food)", _rows=100, _db_hits=0)
ColumnFilter(symKeys=["food", "me", "r1"], returnItemNames=["me", "food"], _rows=1, _db_hits=0)
Filter(pred="Property(me,name(0)) == {name}", _rows=1,_db_hits=148901)
TraversalMatcher(start={"label": "Person", "producer": "NodeByLabel", "identifiers": ["me"]}, trail="(me)-[r1:ATE WHERE true AND true]->(food)", _rows=148901, _db_hits=148901)