0
votes

[UPDATE] I had a few good recommendations on the Neo4j community forum here: https://community.neo4j.com/t/nodejs-neo4j-driver-trying-to-read-and-return-result-set-of-over-4m-records/22705/13

Suggestion 1) Only search on Campaigns with an Industry Sector ID - so add a label to Campaign nodes with an industry sector ID 'HasIndustrySector'.

Suggestion 2) Split the query into two parts: 1st part, grab all campaigns with industry sector. 2nd Part, Aggregate the relationships to be optionally searched together as such:

MATCH (c:Campaign)
WHERE c.industrySectorId IS NOT NULL
RETURN id(c) as campaignId

MATCH (c)-[r:SENT_EMAIL|OPENED|CLICKED]-(p)
WHERE id(c) = $campaignId
WITH c, type(r) as type, count(distinct p) as count LIMIT 1000
WITH c, {type: type, count: count} as data
WITH c, collect(data) as totals
RETURN c.campaignId AS campaignId, c.industrySectorId AS industrySectorId, c.senddate AS sendDate, c.subjectLine AS subject, c.preHeader AS preHeader, totals

This has actually resulted in the query completing. The result of profile on this query is:

134178426 total db hits in 5 ms (it says 5ms, but it definitely took around 50+ seconds to actually complete and return the result of the profile).

enter image description here

[ORIGNAL QUESTION]

We have a new server running dedicated enterprise Neo4j.

Recently imported data from MSSQL database producing around 100m nodes and equal number of relationships.

I have connected the neo4j-driver for nodejs fine and run many queries successfully.

However, I am at a point where I need to read/stream around 4 million records from a particular node label.

I have used both session and reactive session to try to do this, but the results just never appear, even if I leave it for hours. However, if I put a LIMIT clause on the query, I get results fine for anything up to 500,000. However, going above this there is just no response at all and also I can't see anything in the logs.

Here is the query:

cypher runtime=slotted 
MATCH (c:Campaign)
OPTIONAL MATCH (c)<-[:HAS]-(i:IndustrySector)
WITH c,i
OPTIONAL MATCH (c {campaignId: c.campaignId})-[sent:SENT_EMAIL]->(sp:Person)
OPTIONAL MATCH (c {campaignId: c.campaignId})<-[opened:OPENED]-(op:Person)
OPTIONAL MATCH (c {campaignId: c.campaignId})<-[clicked:CLICKED]-(cp:Person)
WITH c, i, COUNT(DISTINCT(sp)) AS totalSent, COUNT(DISTINCT(op)) AS totalOpened, COUNT(DISTINCT(cp)) AS totalClicked
RETURN c.campaignId AS campaignId, i.industrySectorId AS industrySectorId, c.senddate AS sendDate, c.subjectLine AS subject, c.preHeader AS preHeader, totalSent, totalOpened, totalClicked

As you can see I have set the cypher runtime to use slotted. I have tried this without but it seems to send the entire results back in one batch, rather than streaming row by row.

Here are two versions I have tried, first is normal session. The second is using reactive session:

// Normal Session
session.run(query)
.subscribe({
onKeys: keys => {
},
onNext: record => {
},
onCompleted: () => {
},
onError: error => {
}});

// Reactive Session
rxSession.run(query)
.records()
.pipe(map(record => record.toObject()))
.subscribe({
...same as above
})

One other thing I tried was to use SKIP and LIMIT on the query so processing in batches of 10k records. This works fine until it reaches 520,000 records then just hangs again.

Would be very grateful if anyone is able to point me to where I am going wrong or a better way to achieve what I want. I have looked at the apoc library such as apoc.periodic.iterate however that is only applicable to performing write operations.

Thanks for any help.

1
A few questions to help add context: What version of Neo4j? What's the heap size set to be? What does a PROFILE on that query look like? What indexes are in place? How many of each node label and relationship type are there?Pablissimo
Neo4j v4.1.1 . Heap size is set to default because in the configuration it says it automatically adjusts based on system. Profile on the query doesn't complete unless I add a limit clause. Campaign has index on campaignId, no other indexes are used. Approx. 100m Person nodes, 100m relationships being traversed, and 4m campaigns.mrpetem

1 Answers

1
votes

Some observations:

  1. You should not use DISTINCT in COUNT(DISTINCT x) when unnecessary, as that forces the query to accumulate in memory a set of all the distinct values found thus far and to check each new value against that list -- so that it can give you an accurate count. In your query, the sp, op, and cp values should have been naturally distinct anyway (assuming you never have multiple relationships of the same type between the same pair of nodes). So, if you had just used COUNT(x) instead, the query could have simply incremented a counter to get each count. However, the query below does not need to use COUNT at all.
  2. When possible, you should avoid traversing relationships or getting nodes if you don't have to. In your case, you could have gotten the counts very quickly using a Cypher pattern like SIZE((x)-[:FOO]->()) that leverages the internal GetDegree() operation.
  3. The industrySectorId value is completely independent of the counts. So, you should not be expending an effort to get all the counts for each industrySectorId, since the counts would never change between industrySectorIds. The query below just collects all the industrySectorId values in a list, and does each count just once.
  4. (c {campaignId: c.campaignId}) is unnecessarily torturous. It should just have been (c).

This query should be much faster:

MATCH (c:Campaign)
RETURN
  c.campaignId AS campaignId,
  c.senddate AS sendDate, c.subjectLine AS subject, c.preHeader AS preHeader,
  [(c)<-[:HAS]-(i:IndustrySector) | i.industrySectorId] AS industrySectorIds,
  SIZE((c)-[:SENT_EMAIL]->()) AS totalSent,
  SIZE((c)<-[:OPENED]-()) AS totalOpened,
  SIZE((c)<-[:CLICKED]-()) AS totalClicked

It uses quick GetDegree() checks to get the counts without actually traversing any relationships or getting any nodes at the other end. It also just puts all the industrySectorId values in a single industrySectorIds. This way, each result row contains all the desired data for each Campaign.

[UPDATE]

Since you have indicated that there can indeed be multiple relationships of the same type between a Campaign and the same Person, #1 and #2 above do not apply to you.

The following query should still be faster than your query, since it uses #3 (to avoid repeating the same set of expensive queries per IndustrySector) and #4, and does the MATCH/COUNT steps one-at-a-time to avoid cartesian products:

MATCH (c:Campaign)
OPTIONAL MATCH (c)-[:SENT_EMAIL]->(sp:Person)
WITH c, COUNT(DISTINCT sp) AS totalSent
OPTIONAL MATCH (c)<-[:OPENED]-(op:Person)
WITH c, totalSent, COUNT(DISTINCT op) AS totalOpened
OPTIONAL MATCH (c)<-[clicked:CLICKED]-(cp:Person)
WITH c, totalSent, totalOpened, COUNT(DISTINCT cp) AS totalClicked
RETURN
  c.campaignId AS campaignId,
  c.senddate AS sendDate, c.subjectLine AS subject, c.preHeader AS preHeader,
  [(c)<-[:HAS]-(i:IndustrySector) | i.industrySectorId] AS industrySectorIds,
  totalSent,
  totalOpened,
  totalClicked