I'm using Cypher over the batch REST API with Neo4j 2.0.1.
I'm trying to optimize my queries which have lots of optional relationships. I'd like to retrieve all of the data in one shot to limit the number of round trips I have to make to the database. Even though I only have about 12000 nodes in my database, the queries are already starting to crawl (some are taking over 1.5 seconds to return 1000 nodes).
I've set up a graph gist that goes into more detail at http://gist.neo4j.org/?9494429e3cbbbeda2b11.
My queries all generally take the following form:
MATCH (u:user { id: "u1" })
WITH u
MATCH u-[:CONTACT]->(c:contact)
WITH u, c
OPTIONAL MATCH (c)-[:CREATED]->(xca:activity)<-[:USERACTIVITY]-(xcc:contact)
OPTIONAL MATCH (c)-[:HISTORY]->(xcu:activity)<-[:USERACTIVITY]-(xuc:contact)
OPTIONAL MATCH (c)-[:PHONE]->(xp:phone)
OPTIONAL MATCH (c)-[:ADDRESS]->(xa:address)
OPTIONAL MATCH (u)-[:PHONE]->(xup:phone)
OPTIONAL MATCH (u)-[:ADDRESS]->(xua:address)
WITH DISTINCT c AS x, u,
COLLECT(DISTINCT xp) AS xps,
COLLECT(DISTINCT xa) AS xas,
COLLECT(DISTINCT xup) AS xups,
COLLECT(DISTINCT xua) AS xuas,
xca.createdat AS createdat,
xcu.createdat AS updatedat,
{id: xcc.id} AS createdby,
{id: xuc.id} AS updatedby
RETURN COLLECT({
id: x.id,
name: COALESCE(u.name, x.name),
createdat: createdat,
createdby: createdby,
updatedat: updatedat,
updatedby: updatedby,
phones: (CASE WHEN size(xps)= 0
THEN NULL
ELSE [xp IN xps | { id: xp.id, number: xp.number}]
END),
userphones: (CASE WHEN size(xups)= 0
THEN NULL
ELSE [xup IN xups | { id: xup.id, number: xup.number }]
END),
addresses: (CASE WHEN size(xas)= 0
THEN NULL
ELSE [xa IN xas | { id: xa.id, city: xa.city}]
END),
useraddresses: (CASE WHEN size(xuas)= 0
THEN NULL
ELSE [xua IN xuas | { id: xua.id, city: xua.city}]
END)
}) AS r
Is there a better way to query for nodes that have lots of optional relationships? What type of performance should I be expecting in cases like these? Thanks!
UPDATE
I've rewritten the query based on Michael's suggestion and verified that each step has a cardinality of 1 (that is, only one row is returned for each step of the query). When I return just a single contact the query takes about 400ms (about 5x worse than the original query).
However, when I try and run the query against my larger dataset that returns 1000 contacts it now just hangs, never completes, and I have to restart the Neo4j server. I didn't try and combine the results into a new map, but I don't think that is going to solve the problem. Am I creating a worse cross product now that doesn't show up when I test the query step by step?
MATCH (u:user { id: "123" })
WITH u
MATCH (u)-[:CONTACT]->(c:contact)
WITH c
OPTIONAL MATCH
(c)-[:CREATED]->(xca:activity)-[:USERACTIVITY*1..4]<-(xcc:contact),
(c)-[:HISTORY]->(xcu:activity)-[:USERACTIVITY*1..4]<-(xuc:contact)
WITH c AS x,
xca.createdat AS createdat, xcu.createdat AS updatedat,
{id: xcc.id, object: xcc.object} AS createdby,
{id: xuc.id, object: xuc.object} AS updatedby
OPTIONAL MATCH
(x)-[:PHONE]->(xp:phone)
WITH x, createdat, updatedat, createdby, updatedby,
COLLECT(xp) as xps
OPTIONAL MATCH
(x)-[:ADDRESS]->(xa:address)
WITH x, createdat, updatedat, createdby, updatedby, xps,
COLLECT(xa) as xas
OPTIONAL MATCH (xu:user)-[:CONTACT]->(x)
OPTIONAL MATCH (xu)-[:PHONE]->(xup:phone)
WITH x, createdat, updatedat, createdby, updatedby, xps, xas,
xu, COLLECT(xup) as xups
OPTIONAL MATCH (xu)-[:ADDRESS]->(xua:address)
WITH x, createdat, updatedat, createdby, updatedby, xps, xas,
xu, xups, COLLECT(xua) as xuas
RETURN COLLECT({
id: x.id,
object: x.object,
status: x.status,
teamid: x.teamid,
name: COALESCE(xu.name, x.name),
displayname: COALESCE(xu.displayname, x.displayname),
email: COALESCE(xu.email, x.email),
imageurl: COALESCE(xu.imageurl, x.imageurl),
workhours: x.workhours,
notes: x.notes,
company: x.company,
createdat: createdat,
createdby: createdby,
updatedat: updatedat,
updatedby: updatedby,
isuser: (NOT xu IS NULL),
phones: (CASE WHEN size(xps)= 0
THEN NULL
ELSE [xp IN xps | { id: xp.id, object: xp.object,
number: xp.number, description: xp.description }]
END),
userphones: (CASE WHEN size(xups)= 0
THEN NULL
ELSE [xup IN xups | { id: xup.id, object: xup.object,
number: xup.number, description: xup.description }]
END),
addresses: (CASE WHEN size(xas)= 0
THEN NULL
ELSE [xa IN xas | { id: xa.id, object: xa.object,
street: xa.street, locality: xa.locality, region: xa.region,
postcode: xa.postcode, country: xa.country, description: xa.description, neighborhood: xa.neighborhood }]
END),
useraddresses: (CASE WHEN size(xuas)= 0
THEN NULL
ELSE [xua IN xuas | { id: xua.id, object: xua.object,
street: xua.street, locality: xua.locality, region: xua.region,
postcode: xua.postcode, country: xua.country, description: xua.description, neighborhood: xua.neighborhood }]
END)
}) AS r
UPDATE 2
I tried moving the phones and addresses around but it had no effect, I even took them out and still saw similar results (over 2s for 1000 contacts). I've eliminated all of the complexity in the query just to see what a baseline would be. The following query takes 385ms on average to execute:
MATCH (t:team {id:"123"})
WITH t
MATCH (c:contact)-[:CONTACT]->(t)
WITH c AS x
RETURN COLLECT({
id: x.id,
object: x.object,
status: x.status,
teamid: x.teamid,
name: x.name,
displayname: x.displayname,
email: x.email,
imageurl: x.imageurl,
workhours: x.workhours,
notes: x.notes,
company: x.company
}) AS r
My database has 6000 nodes and 12000 relationships and this query returns 1000 contacts (the entire db is 7 MB in size). Is almost 400ms for this type of query expected?
I really appreciate the offer to look at my DB, but I guess I really want to know how to diagnose these problems myself. When I use the web UI, I see no explosion (only 1 row per result being returned). When I use the PROFILE command, I see no numbers in the millions like you expect.
Are there other tools available to diagnose performance issues? Is there a debugger of some kind to track down the problem?