0
votes

I am trying to write a query to find all nodes with outdegree X and only return the paths that contains those nodes when path length is equal to Y

If I want to get only nodes with outdegree X I use the following Cypher query

MATCH (s:URL)-[r:VISITED*]->(t:URL) 
WITH s, count(t) as degreeout 
WHERE 73 in s.job_id and degreeout <4 
return s, degreeout

If I want to get only paths with length = X I use the following query

MATCH p=(s:URL)-[r:VISITED*]->(t:URL)
WHERE length(p)=7
return p 

I tried the combine the previous two queries in the following query

MATCH (s:URL)-[r:VISITED*]->(t:URL)
WITH s, COLLECT(DISTINCT id(s)) as matched, count(t) as degreeout
WHERE 73 in s.job_id and degreeout <4
MATCH p=(s2:URL)-[r:VISITED*]-(t2:URL)
WHERE id(s2) in matched and length(p) >=1
RETURN p

Whenever I execute the query, the machine keeps processing and then I get an error no enough memory.

It seems like there is an infinite loop !!

1

1 Answers

1
votes

If you're just interested in traversing the relationship an exact number of times you can include that in the path expression:

MATCH p=(s:URL)-[r:VISITED*7]->(t:URL)
return p 

In general you should avoid doing traversals of unlimited length, i.e. :VISITED*. If you want to keep the depth variable because it's unknown it's good practice to set a max. value, i.e. :VISITED*..7.

If I understood correctly, your original query can be adjusted, just be setting the variable length to 7 in the path:

MATCH (s:URL)-[r:VISITED*7]->(t:URL) 
WITH s, count(t) as degreeout 
WHERE 73 in s.job_id and degreeout <4 
return s, degreeout

You should see some performance improvement because now paths of length > 7 will be excluded from the results and they won't be traversed. Again, always avoid unlimited depth traversals unless there is a really good reason for it and you have enough computer resources + time for the query to complete.

Regarding performance best practices, this query will still not perform very well, since it's forcing a graph scan to find the start node. I understand the nodes labeled URL contain a property job_id of type Array because of the in operator? Neo4j needs to read all URL nodes and their properties, then scan through those arrays just to find the start node.

I would recommend changing your data model to use schema indexes based on an exact property value. Example:

(j:Job {job_id: 73})-[:SOMETHING?]->(u:URL {...})

We would also add a schema index:

CREATE INDEX ON :Job(job_id)

Then you can query like this:

MATCH (j:Job {job_id: 73})-[:SOMETHING?]->(s:URL)-[r:VISITED*7]->(t:URL) 
WITH s, count(t) as degreeout 
WHERE degreeout <4 
return s, degreeout