neo4j cypher query running very slow

Question

I am using neo4j to store application data and below image depicts the graph structure

Each circle is a node and each arrow depicts a relationship and relation type is defined above. It also defines many to many or one to many or one to one relationship for nodes.

What I would like to retrieve from the graph.

I want to list all the positions for a company and each position will have a array of users and each user will have a array of feedback for each interview round, like below

position ---> candidate1 interview round name (Telephonic) question1 answer1 and answer given by user1 question1 answer1 and answer given by user2 interview round name (HR Round) question1 answer1 and answer given by user1 question1 answer1 and answer given by user2 candidate2 interview round name (Telephonic) question1 answer1 and answer given by user1 question1 answer1 and answer given by user2 interview round name (HR Round) question1 answer1 and answer given by user1 question1 answer1 and answer given by user2 . . .

Many candidates will not have interview rounds than for those candidates the rounds should be null.

Below is the query I am using to retrieve the data I need.

MATCH (comp:Company {dId: "155dyv1wgT"})<-[:`POSITION_COMPANY`]-(pos: Position {status: 'OPEN'})-[:`POSITION_WORKFLOW`]->(:WorkFlow)-[:`WORKFLOW_CANDIDATE-WORKFLOW`]->(cw : CandidateWorkFlow)-[:`CANDIDATE-WORKFLOW_COMPANY-CANDIDATE`]->(cc : CompanyCandidate)

where ((not (has(cc.isSpam) or has(cc.isTrash))) OR (cc.isSpam=false and cc.isTrash=false)) and pos.positionType IN ['PUBLIC','DISCRETE'] with distinct comp, {dId: pos.dId, title: pos.title} as pos, cw, cc

OPTIONAL MATCH (cw)-[:`CANDIDATE_WORKFLOW_INTERVIEW`]->(inwrkflw: InterviewWorkFlow)-[:`INTERVIEW_ROUND`]->(intrnd: InterviewRound)-[:`INTERVIEW_ROUND_FEEDBACK`]->(ffform: FeedbackForm)-[:`FEEDBACK_QUESTION`]-(ffq: Question) 

OPTIONAL MATCH (inwrkflw)-[:`INTERVIEW_WORKFLOW_FEEDBACK`]-(ff:Feedback)

OPTIONAL MATCH (iwr : User)-[:`FEEDBACK_BY`]->(ff)-[:`FEEDBACK_ANSWER`]->(answer:Answer)-[:`QUESTION_ANSWER`]->(ffq) 

with collect({answer : answer.value, rating: answer.rating, question : ffq.qText, givenBy : iwr.fullName, type: ffq.questionType, givenOn: answer.lastModifiedDate}) as rnds, cc, pos, intrnd

with filter(rnd IN rnds WHERE rnd.type = 'COMMENTS') as comments, filter(rnd IN rnds WHERE rnd.type = 'LINEAR_GENERIC') as ratings, cc, pos, intrnd

with distinct collect({roundName: intrnd.name, ratings: ratings, comments: comments}) as rounds, cc, pos

return collect({cc: cc, rounds: rounds}) as data, pos.dId as posId, pos.title as posTitle

dId is unique on every node.

The problem with this query is for small data set let's say 1000 candidates with 10 positions it will run fine. But for large data sets it it taking too long to return results. I even waited for 5 mins in neo4j console for the response but there was no response in 5 minutes.

The application will not have 1000 candidates. The candidates count will go upto 100000 for a very least and maximum I can assume to be 1 million per company.

I have tried various way to optimise this query but couldn't able to get response.

The response SLA should be within 20 seconds.

My questions are

How can I optimise this query to get the results I want ?
What is wrong in current query ?

If I do profile I won't get any response in 3 4 minutes, explain will do. — Ritesh Sangwan

Michael Hunger Michael Hunger · Accepted Answer · 2016-06-04T21:50:55

First I upgraded your db to 2.3.3

Your model consists of two subgraphs that describe

the meta-information of the hiring process
the concrete feedback/answers for one candidate

For your company both are pretty large:

235600 paths from company to Question via the interview process
83937 from company to question via the concrete answers

each of which takes about 1s to compute

If you just query across them, you'll multiply out the numbers which ends up to 20bn paths.

which takes quite forever to compute

My solution was to query one sub-graph first and then put it aside (in an aggregation) and then query the second subgraph

First I brought them together by matching answers to questions (which is turned into an ExpandInto operation)

WHERE (answer)-[:QUESTION_ANSWER]->(ffq)

Which made the query finish in about 15 seconds.

Then I extended the concrete (answer) subgraph one step to the questions and brought them together at the question itself (ffq = ffq2)

This brought the total execution time down to 1.6 seconds.

Here is the final query:

MATCH (comp:Company {dId: "155dyv1wgT"})<-[:`POSITION_COMPANY`]-(pos: Position {status: 'OPEN'})-[:`POSITION_WORKFLOW`]->(:WorkFlow)-[:`WORKFLOW_CANDIDATE-WORKFLOW`]->(cw : CandidateWorkFlow)-[:`CANDIDATE-WORKFLOW_COMPANY-CANDIDATE`]->(cc : CompanyCandidate)

where ((not (has(cc.isSpam) or has(cc.isTrash))) OR (cc.isSpam=false and cc.isTrash=false)) and pos.positionType IN ['PUBLIC','DISCRETE'] with distinct comp, {dId: pos.dId, title: pos.title} as pos, cw, cc

with distinct comp, {dId: pos.dId, title: pos.title} as pos, cw, cc

MATCH (cw)-[:`CANDIDATE_WORKFLOW_INTERVIEW`]->(inwrkflw)

MATCH (inwrkflw)-[:`INTERVIEW_ROUND`]->(intrnd)-[:`INTERVIEW_ROUND_FEEDBACK`]->(ffform)-[:`FEEDBACK_QUESTION`]-(ffq) 

WITH comp,pos, cw, cc,inwrkflw, collect({round:intrnd,form:ffform,question:ffq}) as workflow_questions

MATCH (inwrkflw)-[:`INTERVIEW_WORKFLOW_FEEDBACK`]-(ff:Feedback)
MATCH (iwr : User)-[:`FEEDBACK_BY`]->(ff)-[:`FEEDBACK_ANSWER`]->(answer:Answer)-[:`QUESTION_ANSWER`]->(ffq2) 

UNWIND workflow_questions as wq

WITH comp,pos, cw, cc,inwrkflw, iwr,ff,answer, wq.round as intrnd, wq.form as ffform, wq.question as ffq

WHERE ffq2 = ffq

with collect({answer : answer.value, rating: answer.rating, question : ffq.qText, givenBy : iwr.fullName, type: ffq.questionType, givenOn: answer.lastModifiedDate}) as rnds, cc, pos, intrnd

with filter(rnd IN rnds WHERE rnd.type = 'COMMENTS') as comments, filter(rnd IN rnds WHERE rnd.type = 'LINEAR_GENERIC') as ratings, cc, pos, intrnd

with collect({roundName: intrnd.name, ratings: ratings, comments: comments}) as rounds, cc, pos

return collect({cc: cc, rounds: rounds}) as data, pos.dId as posId, pos.title as posTitle;

neo4j cypher query running very slow

1 Answers