3
votes

I'm pretty new to neo4j and I'm having a difficult time optimizing a query which returns a large amount of nodes / relationships.

The following query:

MATCH (u:User)-[:CAN_ADMINISTER]->(cs:CustomerSite)
WHERE u.id="1234" WITH cs
MATCH r1=(cs)<-[:AFFECTS_SITE]-(t:Ticket)
WHERE not(t.status = "COMPLETE")
OPTIONAL MATCH r2=(t)-[:HAS_EVENTS]->(te:TicketEvent)
OPTIONAL MATCH r3=(t)-[:CREATED_BY]->(u:User)
OPTIONAL MATCH r4=(te)<-[:HAS_EVENTS]-(u2:User)
OPTIONAL MATCH r5=(t)-[:AFFECTS_SITE]->(cs)<-[:HAS_SITE]-(c:Customer)
RETURN r1, r2, r3, r4, r5

Takes nearly a minute to run for a user that produces ~7000 rows. I've tried to reorganize it to not much avail. Below is the current profile.

enter image description here

Any suggestions on what might help here?

1
First suggestion: use PROFILE on this query, and in the tree generated by the profile, expand all nodes, and paste that profile into your description. This should be a common starting point for figuring out performance problems. Also, can you confirm that the output of your query is correct, or is it producing anything that's unexpected, either in the data or number of results. Also, I'm assuming that User.id has a unique constraint...is that true?InverseFalcon
The data does seem to be correct and you are correct, User.id is a unique constraint.Greg
I think the next useful thing to figure out if there is any data you're returning which you really don't need, and if we can collect certain results instead of outputting a row per result. For example, it seems like you want to return Users who have TicketEvents associated with the Ticket, and Customers who have sites affected by the Ticket. Do you need to return the associated TicketEvents and sites, or do you just need to know the users and/or customers associated by these things? And can we output these as collections of Users or Customers, or do you need one row for each one?InverseFalcon
I need to return all of the paths so that the data can be associated / displayed in java land. So I want tickets that are related to sites which a user can administer, and all of the relationships of said tickets (ticketEvents, createdBy (User), assignedTo (User) etc. If I return only one path at a time, (i.e. p1, p2, etc) the query isn't all that slow.Greg
The more rows are being generated along the way, the costlier "simple" queries will be, which is why it's valuable to collect results into lists where appropriate to output single rows instead of possibly hundreds, which slows down your other queries. For example, querying for the User who created a Ticket should be quick, as you only have 139 tickets. But you do this optional match AFTER you've emitted the TicketEvents for the Ticket, so the query runs over 726 rows instead of 139, so many of those db hits are redundant since the same ticket is present in many rows.InverseFalcon

1 Answers

4
votes

I highly recommend collecting results from your OPTIONAL MATCHES as needed, and using WITH to break up your queries and narrow down the columns you're interested in to keep your rows down between your subqueries. As explained in the question comments, MATCHES and OPTIONAL MATCHES can build up result rows, which can make queries that SEEM like they should be quick costlier.

For example, I will add comments to analyze the original query inline:

MATCH (u:User)-[:CAN_ADMINISTER]->(cs:CustomerSite)
WHERE u.id="1234" WITH cs
MATCH r1=(cs)<-[:AFFECTS_SITE]-(t:Ticket)
WHERE not(t.status = "COMPLETE")
// we have 1 row per User at a CustomerSite
OPTIONAL MATCH r2=(t)-[:HAS_EVENTS]->(te:TicketEvent)
// now, 1 row per User @ CustomerSite per TicketEvent
OPTIONAL MATCH r3=(t)-[:CREATED_BY]->(u:User)
// the above OPTIONAL MATCH had to iterate over each User/CS/TE row instead of just each distinct TICKET
OPTIONAL MATCH r4=(te)<-[:HAS_EVENTS]-(u2:User)
// now, 1 row per User @ CustomerSite per User on each Ticket Event
OPTIONAL MATCH r5=(t)-[:AFFECTS_SITE]->(cs)<-[:HAS_SITE]-(c:Customer)
// now, 1 row per User @ CustomerSite per User on each Ticket Event per Customer at each Customer Site
RETURN r1, r2, r3, r4, r5

While it changes the format of the returned data, doing COLLECTS along the way, and ordering your OPTIONAL MATCHES better, should improve query speed. Here is one way you could do this:

MATCH (u:User)-[:CAN_ADMINISTER]->(cs:CustomerSite)
WHERE u.id="1234" WITH cs
MATCH (cs)<-[:AFFECTS_SITE]-(t:Ticket)
WHERE not(t.status = "COMPLETE")
// should be 1 creator per ticket, so best to do this first
OPTIONAL MATCH (t)-[:CREATED_BY]->(creator:User)
OPTIONAL MATCH (cs)<-[:HAS_SITE]-(affectedCustomer:Customer)
// collection of affected customers for each ticket (and their creator) affecting a customer site
WITH cs, t, creator, COLLECT(affectedCustomer) as affectedCustomers
OPTIONAL MATCH (t)-[:HAS_EVENTS]->(te:TicketEvent)<-[:HAS_EVENTS]-(userOnEvent:User)
WITH cs, t, creator, affectedCustomers, te, COLLECT(userOnEvent) as usersOnEvent
RETURN cs, t, creator, affectedCustomers, COLLECT({ticketEvent:te, usersOnEvent:usersOnEvent}) as ticketEventsAndUsers

Each row will correspond with a ticket at a customer site, the ticket's creator, the collection of affected customers at the site, and a collection of ticket events for the ticket and for the event, the users on that event.

Give that a try and see how the performance compares. If it looks better, you'll have to change how you parse through the returned data, but it's nothing a little for loop or two can't handle.