Optimizing Cypher queries with lots of optional relationships

Question

I'm using Cypher over the batch REST API with Neo4j 2.0.1.

I'm trying to optimize my queries which have lots of optional relationships. I'd like to retrieve all of the data in one shot to limit the number of round trips I have to make to the database. Even though I only have about 12000 nodes in my database, the queries are already starting to crawl (some are taking over 1.5 seconds to return 1000 nodes).

I've set up a graph gist that goes into more detail at http://gist.neo4j.org/?9494429e3cbbbeda2b11.

My queries all generally take the following form:

MATCH (u:user { id: "u1" })
WITH u

MATCH u-[:CONTACT]->(c:contact)
WITH u, c

OPTIONAL MATCH  (c)-[:CREATED]->(xca:activity)<-[:USERACTIVITY]-(xcc:contact)
OPTIONAL MATCH  (c)-[:HISTORY]->(xcu:activity)<-[:USERACTIVITY]-(xuc:contact)
OPTIONAL MATCH (c)-[:PHONE]->(xp:phone)
OPTIONAL MATCH (c)-[:ADDRESS]->(xa:address)
OPTIONAL MATCH (u)-[:PHONE]->(xup:phone)
OPTIONAL MATCH (u)-[:ADDRESS]->(xua:address)
WITH DISTINCT c AS x, u,
    COLLECT(DISTINCT xp) AS xps,
    COLLECT(DISTINCT xa) AS xas,
    COLLECT(DISTINCT xup) AS xups,
    COLLECT(DISTINCT xua) AS xuas,
    xca.createdat AS createdat,
    xcu.createdat AS updatedat,
    {id: xcc.id} AS createdby,
    {id: xuc.id} AS updatedby
RETURN COLLECT({
    id: x.id,
    name:  COALESCE(u.name, x.name), 
    createdat: createdat,
    createdby: createdby,
    updatedat: updatedat,
    updatedby: updatedby,
        phones: (CASE WHEN size(xps)= 0
          THEN NULL
          ELSE [xp IN xps | { id: xp.id, number: xp.number}]
          END),
        userphones: (CASE WHEN size(xups)= 0
          THEN NULL
          ELSE [xup IN xups | { id: xup.id, number: xup.number }]
          END),
        addresses: (CASE WHEN size(xas)= 0
          THEN NULL
          ELSE [xa IN xas | { id: xa.id, city: xa.city}]
          END),
        useraddresses: (CASE WHEN size(xuas)= 0
          THEN NULL
          ELSE [xua IN xuas | { id: xua.id, city: xua.city}]
          END)
}) AS r

Is there a better way to query for nodes that have lots of optional relationships? What type of performance should I be expecting in cases like these? Thanks!

UPDATE

I've rewritten the query based on Michael's suggestion and verified that each step has a cardinality of 1 (that is, only one row is returned for each step of the query). When I return just a single contact the query takes about 400ms (about 5x worse than the original query).

However, when I try and run the query against my larger dataset that returns 1000 contacts it now just hangs, never completes, and I have to restart the Neo4j server. I didn't try and combine the results into a new map, but I don't think that is going to solve the problem. Am I creating a worse cross product now that doesn't show up when I test the query step by step?

  MATCH (u:user { id: "123" })
  WITH u

  MATCH (u)-[:CONTACT]->(c:contact)
  WITH c

  OPTIONAL MATCH
    (c)-[:CREATED]->(xca:activity)-[:USERACTIVITY*1..4]<-(xcc:contact),
    (c)-[:HISTORY]->(xcu:activity)-[:USERACTIVITY*1..4]<-(xuc:contact)
  WITH c AS x,
    xca.createdat AS createdat, xcu.createdat AS updatedat,
    {id: xcc.id, object: xcc.object} AS createdby,
    {id: xuc.id, object: xuc.object} AS updatedby

  OPTIONAL MATCH
    (x)-[:PHONE]->(xp:phone)
  WITH x, createdat, updatedat, createdby, updatedby,
    COLLECT(xp) as xps

  OPTIONAL MATCH  
    (x)-[:ADDRESS]->(xa:address)
  WITH x, createdat, updatedat, createdby, updatedby, xps,
    COLLECT(xa) as xas

  OPTIONAL MATCH (xu:user)-[:CONTACT]->(x)
  OPTIONAL MATCH (xu)-[:PHONE]->(xup:phone)
  WITH x, createdat, updatedat, createdby, updatedby, xps, xas,
    xu, COLLECT(xup) as xups

  OPTIONAL MATCH (xu)-[:ADDRESS]->(xua:address)
  WITH x, createdat, updatedat, createdby, updatedby, xps, xas,
    xu, xups, COLLECT(xua) as xuas

  RETURN COLLECT({
    id: x.id,
    object: x.object,
    status: x.status,
    teamid: x.teamid,
    name:  COALESCE(xu.name, x.name),
    displayname: COALESCE(xu.displayname, x.displayname),
    email: COALESCE(xu.email, x.email),
    imageurl: COALESCE(xu.imageurl, x.imageurl),
    workhours: x.workhours,
    notes: x.notes,
    company: x.company,
    createdat: createdat,
    createdby: createdby,
    updatedat: updatedat,
    updatedby: updatedby,
    isuser: (NOT xu IS NULL),
    phones: (CASE WHEN size(xps)= 0
      THEN NULL
      ELSE [xp IN xps | { id: xp.id, object: xp.object,
            number: xp.number, description: xp.description }]
      END),
    userphones: (CASE WHEN size(xups)= 0
      THEN NULL
      ELSE [xup IN xups | { id: xup.id, object: xup.object,
            number: xup.number, description: xup.description }]
      END),
    addresses: (CASE WHEN size(xas)= 0
      THEN NULL
      ELSE [xa IN xas | { id: xa.id, object: xa.object,
            street: xa.street, locality: xa.locality, region: xa.region,
            postcode: xa.postcode, country: xa.country, description: xa.description, neighborhood: xa.neighborhood }]
      END),
    useraddresses: (CASE WHEN size(xuas)= 0
      THEN NULL
      ELSE [xua IN xuas | { id: xua.id, object: xua.object,
            street: xua.street, locality: xua.locality, region: xua.region,
            postcode: xua.postcode, country: xua.country, description: xua.description, neighborhood: xua.neighborhood }]
      END)
  }) AS r

UPDATE 2

I tried moving the phones and addresses around but it had no effect, I even took them out and still saw similar results (over 2s for 1000 contacts). I've eliminated all of the complexity in the query just to see what a baseline would be. The following query takes 385ms on average to execute:

MATCH (t:team {id:"123"})
WITH t

MATCH (c:contact)-[:CONTACT]->(t)
WITH c AS x

RETURN COLLECT({
    id: x.id,
    object: x.object,
    status: x.status,
    teamid: x.teamid,
    name:  x.name,
    displayname: x.displayname,
    email: x.email,
    imageurl: x.imageurl,
    workhours: x.workhours,
    notes: x.notes,
    company: x.company
}) AS r

My database has 6000 nodes and 12000 relationships and this query returns 1000 contacts (the entire db is 7 MB in size). Is almost 400ms for this type of query expected?

I really appreciate the offer to look at my DB, but I guess I really want to know how to diagnose these problems myself. When I use the web UI, I see no explosion (only 1 row per result being returned). When I use the PROFILE command, I see no numbers in the millions like you expect.

Are there other tools available to diagnose performance issues? Is there a debugger of some kind to track down the problem?

You didn't tell us that you're doing variable length matches in between. You also didn't pull the phone + address to the beginning. If you have those var-length matches you have to separate them into their own match-with fragment, otherwise this blows up. Try to build the query one step at a time, and see when the explosion comes in. — Michael Hunger
I think your updated query has 10x the complexity of the first one as you go much deeper into the graph and also resolve a lot of additional information. If you can share your db, I'm happy to take a look, but I presume you're touching many many million paths. — Michael Hunger
The reason it wasn't returning was because of a typo, the directional arrows on the 6th and 7th lines were missing so the links became unidirectional. Making them directional allows the query to complete, but it's still taking over 2 seconds. I've stepped through each part of the query using the Web UI and at every step there is only 1 row returned. Using the PROFILE command shows no signs of path explosion. The UserActivity nodes only have 1 relationship in the direction I'm searching so (:activity)<-[:USERACTIVITY*1..4]-(:contact) traverses a grand total of 4 nodes. — Bill
I've updated the question and eliminated all of the optional matches from the query and I still can't get 1000 results in under 350 ms. I also re-stepped through the query from my previous code and every single WITH statement is returning 1 result via the Web UI. — Bill

Michael Hunger Michael Hunger · Accepted Answer · 2014-05-19T00:13:39

The issue is that you create cross products between all of your matches.

If you can identify the matches that have at most one connection you can pull them upfront. Otherwise you can collect the matched information to get back to your cardinality of 1 (or # of contacts ftm).

e.g.

MATCH (u:user { id: "u1" })
OPTIONAL MATCH (u)-[:PHONE]->(xup:phone)
OPTIONAL MATCH (u)-[:ADDRESS]->(xua:address)
// cardinality 1
WITH u, collect(distinct xup) as phones, collect(distinct xua) as addresses
MATCH (u)-[:CONTACT]->(c:contact)
WITH u, c, phones, addresses
OPTIONAL MATCH (c)-[:CREATED]->(xca:activity)<-[:USERACTIVITY]-(xcc:contact)
WITH u,c, phones,addresses, collect(distinct xcc) as contact_activities
...

You already used map literals so you can also combine them with what I suggested by incrementally adding keys to the map (or a collection)

e.g.

MATCH (u:user { id: "u1" })
OPTIONAL MATCH (u)-[:PHONE]->(xup:phone)
OPTIONAL MATCH (u)-[:ADDRESS]->(xua:address)
// cardinality 1
WITH u, {user:u, phones:collect(distinct xup), addresses: collect(distinct xua)} as user_info
MATCH (u)-[:CONTACT]->(c:contact)
WITH c, user_info
OPTIONAL MATCH (c)-[:CREATED]->(xca:activity)<-[:USERACTIVITY]-(xcc:contact)
WITH c, user_info, {activities: collect(distinct xcc)} as contact_info
...

Also DISTINCT and aggregation will very probably not help you. Aggregation is already creating distinct entries for the grouping key.

I tried to adapt your graphgist (thanks for providing that btw) to show how it could look like (but I didn't go fully through): http://gist.neo4j.org/?bba019835045ed352925

You might be also interested in this graphgist: A complex query result projection in cypher

Optimizing Cypher queries with lots of optional relationships

UPDATE

UPDATE 2

1 Answers