0
votes

I'm working on a Cypher that returns a "combined limit" on two set of results, one is immediate neighbors, the other is neighbors cross "event nodes", as following:

OPTIONAL MATCH (subject:Person {age:"38"})--(event:Event)--(targetViaEvent)   
OPTIONAL MATCH (subject)--(directTarget)  
  WHERE NOT directTarget:Event  
WITH subject, targetViaEvent, directTarget,  
  COUNT(event) AS eventCount 
  ORDER BY eventCount DESC  
WITH subject, COLLECT(directTarget) + COLLECT(targetViaEvent) as targetList  
UNWIND targetList AS target  
WITH DISTINCT subject, target 
SKIP 0 LIMIT 10
...

The main purpose of this Cypher query is:

  1. Find all the neighbors
  2. If a neighbor is labeled Event, find the other neighbors of the event
  3. Sort the event-connected-neighbor by the amount of events
  4. Return neighbors found above, whether labeled Event or not, use skip and limit for pagination 4.1. If capable, return neighbors with Event label ahead over the ones without

Other specifications:

  1. All relationship types and directions are taking account, so these are not filtered

With COLLECT() used, the execution time gets unbelievably slow, making neo4j shell stall, as each subject may have ten thousands of directTarget and targetViaEvent. I suspect COLLECT() caches every matched node object in memory, thus jams Neo4j in this data scale. My intention is just to combine the two, and do limitation altogether. Is there any tricks to improve my Cypher?


EDIT:

As @InverseFalcon pointed out my mistake in my Cypher above, here's my entire Cypher with updates:

PROFILE MATCH (subject:Person {age:"38"})
OPTIONAL MATCH (subject)--(directTarget)
  WHERE NOT directTarget:Event
OPTIONAL MATCH (subject)--(event:Event)--(targetViaEvent)
WITH subject, targetViaEvent, directTarget,
     COUNT(event) AS eventCount ORDER BY eventCount DESC
WITH subject, COLLECT(directTarget) + COLLECT(targetViaEvent) as targetList
UNWIND targetList AS target
WITH DISTINCT subject, target SKIP 0 LIMIT 300 WHERE target IS NOT NULL
OPTIONAL MATCH (subject)-[subject_target]-(target)
OPTIONAL MATCH (subject)--(eventPrime)--(target)
WITH subject, subject_target, target, COLLECT(eventPrime)[0..200] AS eventList
UNWIND (CASE eventList WHEN [] THEN [null] else eventList end) as limitedEvents
OPTIONAL MATCH (subject)-[subject_event]-(limitedEvents)-[event_target]-(target)
RETURN subject, subject_target, target, subject_event, limitedEvents, event_target

Note: after the SKIP...LIMIT... I repeat the query only to identify the relationships between the nodes, in the sense that a) I'd like to have relationships in the json result; b) after quite a few attempts I can't manage to fetch relationships along with the first 3 MATCHs, specifically COUNT(event) doesn't work because each event is bidden with a relationship so that the count is constantly 1.

1
You have some ordering by eventCount occurring in the middle, so that complicates the approach for dealing with both targetViaEvent and directTarget in the same match. Also, you haven't given any details on the relationships in your matches, that information could be useful for devising a match that will match on both type of nodes at once, as well as whether the relationship types used are exclusive for these kinds of patterns.InverseFalcon
Thanks for the clarification. Given those requirements, I don't think we can construct a MATCH that will fulfill your conditions and match on both directTarget and targetViaEvent nodes at the same time in the same column. Until Neo4j or APOC procedures introduces some means of pulling results from separate matches into the same column (a UNION WITH would be perfect), adding the collections together is probably the best available approach.InverseFalcon
@InverseFalcon yes it would be lovely to have UNION WITH! Does this implies the solution in your answer won't fulfill my requirements?Todd Leo
My answer should work with your requirements, and by building up the aggregations right after each separate match instead of trying to do both matches and aggregations at the same time you should see a fairly substantial performance improvement, but I'm unsure how well combining huge collections will scale with the number of targets in your graph. Hoping for good results, let us know how it works for you, or if this remains a bottleneck.InverseFalcon

1 Answers

2
votes

We can improve your query a bit, as it stands now you're building up rows with each event + targetViaEvent in a cartesian product with every directTarget, so you're doing a ton of work you don't need to do. A good approach, especially with back-to-back MATCHes or OPTIONAL MATCHes where you want aggregations from both, is to build up your aggregations on each of them individually rather than trying to do them all at once. This avoids a cartesian product.

I'd suggest this as a replacement query:

MATCH (subject:Person {age:"38"})
OPTIONAL MATCH (subject)--(event:Event)--(targetViaEvent)
WITH subject, COUNT(event) AS eventCount, targetViaEvent
ORDER BY eventCount DESC
WITH subject, COLLECT(targetViaEvent) as eventTargets
// Above WITH means we now have only one row per subject so far
OPTIONAL MATCH (subject)--(directTarget)
  WHERE NOT directTarget:Event
WITH subject, COLLECT(directTarget) + eventTargets as targetList
UNWIND targetList AS target
WITH DISTINCT subject, target SKIP 0 LIMIT 10
...

EDIT

I just noticed a problem in your original query. In your two OPTIONAL MATCHes, you're sharing the 'subject' variable. That makes your second OPTIONAL MATCH dependent upon the subjects from your first OPTIONAL MATCH. It will not look for that pattern on :Persons who did not match your first OPTIONAL MATCH.

Basically, that set of OPTIONAL MATCHES should actually execute identically to if the first OPTIONAL MATCH was a MATCH instead.

If your intent was to run both OPTIONAL MATCHes on all :Persons, then you may have to change the first part of your query to this:

MATCH (subject:Person {age:"38"})
OPTIONAL MATCH (subject)--(event:Event)--(targetViaEvent)   
OPTIONAL MATCH (subject)--(directTarget) 
... 

This may impact both your original query's speed and number of results built up.

Also, the result of both our queries (after you change yours) will also be returning rows of subjects without targets where both OPTIONAL MATCHes didn't match anything for a subject (in those cases, a single subject with a null target). If these aren't desired in the return, we'll both need to add WHERE target IS NOT NULL after the final WITH.