Neo4j Cypher query for similarity calculation slow performance

Question

I'm very new to Neo4j/Graph databases and trying to replicate tutorial from Cypher cookbook: http://docs.neo4j.org/chunked/stable/cypher-cookbook-similarity-calc.html

Random data set with 100 foods and 1500 persons, all persons are related to food through ATE relationship with "times" integer property. Food and Person are labeled and have properties "name" - which is indexed by auto-index

neo4j-sh (?)$ dbinfo -g "Primitive count"
{
  "NumberOfNodeIdsInUse": 1600,
  "NumberOfPropertyIdsInUse": 151600,
  "NumberOfRelationshipIdsInUse": 150000,
  "NumberOfRelationshipTypeIdsInUse": 1
}

neo4j-sh (?)$ index --indexes
Node indexes:
  node_auto_index
Relationship indexes:
  relationship_auto_index

Running modified query from cookbook in neo4j-shell never completes (probably because of too much nodes/relationships?):

EXPORT name="Florida Goyette"
MATCH (me:Person { name: {name}})-[r1:ATE]->(food)<-[r2:ATE]-(you:Person)
WITH me,count(DISTINCT r1) AS H1,count(DISTINCT r2) AS H2,you
MATCH (me)-[r1:ATE]->(food)<-[r2:ATE]-(you)
RETURN SUM((1-ABS(r1.times/H1-r2.times/H2))*(r1.times+r2.times)/(H1+H2)) AS similarity
LIMIT 100;

So I started to look how can I limit earlier to "first" 100 persons and came out with this:

EXPORT name="Florida Goyette"
MATCH (me:Person { name: {name} })-[r1:ATE]->(food)
WITH me, food
MATCH (food)<-[r2:ATE]-(you)
WHERE me <> you
WITH me, you
LIMIT 100
MATCH (me)-[r1:ATE]->(food)<-[r2:ATE]-(you)
WITH me, count(DISTINCT r1) AS H1, count(DISTINCT r2) AS H2, you
MATCH (me)-[r1:ATE]->(food)<-[r2:ATE]-(you)
WITH me, you, SUM((1-ABS(r1.times/H1-r2.times/H2))*(r1.times+r2.times)/(H1+H2)) AS similarity
RETURN me.name, you.name, similarity
ORDER BY similarity DESC;

But this query performs very poorly on warmed up cache

100 rows
16038 ms

Is there any chance to make such query to perform faster, for "real-time" usage?

System and Neo4j

Windows 7 (64-bit), Intel Core I7-2600K, 8GB RAM, Neo4j database on SSD drive.

Neo4j Community version: 2.1.0-M01 (also tested on 2.0.1 stable)

neo4j-community.options

 -Xmx2048m
 -Xms2048m

neo4j.properties

neostore.nodestore.db.mapped_memory=200M
neostore.relationshipstore.db.mapped_memory=200M
neostore.propertystore.db.mapped_memory=200M
neostore.propertystore.db.strings.mapped_memory=330M
neostore.propertystore.db.arrays.mapped_memory=330M

node_auto_indexing=true
node_keys_indexable=name
relationship_auto_indexing=true
relationship_keys_indexable=times

Cypher dump of my data (503kb zipped)

PROFILE output

ColumnFilter(symKeys=["similarity", "you", "you.name", "me", "me.name"], returnItemNames=["me.name", "you.name", "similarity"], _rows=100, _db_hits=0)
Sort(descr=["SortItem(similarity,false)"], _rows=100, _db_hits=0)
  Extract(symKeys=["me", "you", "similarity"], exprKeys=["me.name", "you.name"], _rows=100, _db_hits=200)
    ColumnFilter(symKeys=["me", "you", "  INTERNAL_AGGREGATEcb085cf5-8982-4a83-ba3d-9642de570c59"], returnItemNames=["me", "you", "similarity"], _rows=100, _db_hits=0)
      EagerAggregation(keys=["me", "you"], aggregates=["(INTERNAL_AGGREGATEcb085cf5-8982-4a83-ba3d-9642de570c59,Sum(Divide(Multiply(Subtract(Literal(1),AbsFunction(Subtract(Divide(Property(r1,times(1)),H1),Divide(Property(r2,times(1)),H2)))),Add(Property(r1,times(1)),Property(r2,times(1)))),Add(H1,H2))))"], _rows=100, _db_hits=40000)
        SimplePatternMatcher(g="(you)-['r2']-(food),(me)-['r1']-(food)", _rows=10000, _db_hits=0)
          ColumnFilter(symKeys=["me", "you", "  INTERNAL_AGGREGATE677cd11c-ae53-4d7b-8df6-732ffed28bbf", "  INTERNAL_AGGREGATEb5eb877c-de01-4e7a-9596-03cd94cfa47a"], returnItemNames=["me", "H1", "H2", "you"], _rows=100, _db_hits=0)
            EagerAggregation(keys=["me", "you"], aggregates=["(  INTERNAL_AGGREGATE677cd11c-ae53-4d7b-8df6-732ffed28bbf,Distinct(Count(r1),r1))", "(  INTERNAL_AGGREGATEb5eb877c-de01-4e7a-9596-03cd94cfa47a,Distinct(Count(r2),r2))"], _rows=100, _db_hits=0)
              SimplePatternMatcher(g="(you)-['r2']-(food),(me)-['r1']-(food)", _rows=10000, _db_hits=0)
                ColumnFilter(symKeys=["me", "food", "you", "r2"], returnItemNames=["me", "you"], _rows=100, _db_hits=0)
                  Slice(limit="Literal(100)", _rows=100, _db_hits=0)
                    Filter(pred="NOT(me == you)", _rows=100, _db_hits=0)
                      SimplePatternMatcher(g="(you)-['r2']-(food)", _rows=100, _db_hits=0)
                        ColumnFilter(symKeys=["food", "me", "r1"], returnItemNames=["me", "food"], _rows=1, _db_hits=0)
                          Filter(pred="Property(me,name(0)) == {name}", _rows=1,_db_hits=148901)
                            TraversalMatcher(start={"label": "Person", "producer": "NodeByLabel", "identifiers": ["me"]}, trail="(me)-[r1:ATE WHERE true AND true]->(food)", _rows=148901, _db_hits=148901)

cybersam cybersam · Accepted Answer · 2014-04-03T19:46:20

You are doing the same MATCHing multiple times. Does this work better?

EXPORT name="Florida Goyette"
MATCH (me:Person { name: {name}})-[r1:ATE]->(food)<-[r2:ATE]-(you:Person)
WITH me,r1,r2,count(DISTINCT r1) AS H1,count(DISTINCT r2) AS H2,you
LIMIT 100 
RETURN SUM((1-ABS(r1.times/H1-r2.times/H2))*(r1.times+r2.times)/(H1+H2)) AS similarity;

Neo4j Cypher query for similarity calculation slow performance

3 Answers