1
votes

To keep things simple, as part of the ETL on my time-series data, I added a sequence number property to each row corresponding to 0..370365 (370,366 nodes, 5,555,490 properties - not that big). I later added a second property and named it "outeseq" (original) and "ineseq" (second) to see if an outright equivalence to base the relationship on might speed things up a bit.

I can get both of the following queries to run properly on up to ~30k nodes (LIMIT 30000) but past that, its just an endless wait. My JVM has 16g max (if it can even use it on a windows box):

MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.outeseq-1
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;

or

MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.ineseq
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;

I also added these in hopes of speeding things up:

CREATE CONSTRAINT ON (a:BOOK)
ASSERT a.outeseq IS UNIQUE

CREATE CONSTRAINT ON (b:BOOK)
ASSERT b.ineseq IS UNIQUE

I can't get the relationships created for the entire data set! Help!

Alternatively, I can also get bits of the relationships built with parameters, but haven't figured out how to parameterize the sequence over all of the node-to-node sequential relationships, at least not in a semantically general enough way to do this.

I profiled the query, but did't see any reason for it to "blow-up".

Another question: I would like each relationship to have a property to represent the difference in the time-stamps of each node or delta-t. Is there a way to take the difference between the two values in two sequential nodes, and assign it to the relationship?....for all of the relationships at the same time?

The last Q, if you have the time - I'd really like to use the raw data and just chain the directed relationships from one nodes'stamp to the next nearest node with the minimum delta, but didn't run right at this for fear that it cause scanning of all the nodes in order to build each relationship.

Before anyone suggests that I look to KDB or other db's for time series, let me say I have a very specific reason to want to use a DAG representation.

It seems like this should be so easy...it probably is and I'm blind. Thanks!

1
One other point - I tried using "LOAD CSV" to CREATE the relationships from the proxy adjacency list, but could not get it to work. I modified the 4 or 5 examples I could find online but could not get this approach to work. If anyone has a code snippet that more closely approximates what I am after, that might just be the ticket. The CSV has ineseq: toInt(line.ineseq), outeseq: toInt(line.outeseq), timestamp: toFloat(line.timestamp) available. - dwozman

1 Answers

0
votes

Creating Relationships

Since your queries work on 30k nodes, I'd suggest to run them page by page over all the nodes. It seems feasible because outeseq and ineseq are unique and numeric so you can sort nodes by that properties and run query against one slice at time.

MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq = b.outeseq-1
WITH a, b ORDER BY a.outeseq SKIP {offset} LIMIT 30000
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;

It will take about 13 times to run the query changing {offset} to cover all the data. It would be nice to write a script on any language which has a neo4j client.

Updating Relationship's Properties

You can assign timestamp delta to relationships using SET clause following the MATCH. Assuming that a timestamp is a long:

MATCH (a:BOOK)-[s:FORWARD_SEQ]->(b:BOOK)
SET s.delta = abs(b.timestamp - a.timestamp);

Chaining Nodes With Minimal Delta

When relationships have the delta property inside, the graph becomes a weighted graph. So we can apply this approach to calculate the shortest path using deltas. Then we just save the length of the shortest path (summ of deltas) into the relation between the first and the last node.

MATCH p=(a:BOOK)-[:FORWARD_SEQ*1..]->(b:BOOK)
WITH p AS shortestPath, a, b,
    reduce(weight=0, r in relationships(p) : weight+r.delta) AS totalDelta
    ORDER BY totalDelta ASC
    LIMIT 1
MERGE (a)-[nearest:NEAREST {delta: totalDelta}]->(b)
RETURN nearest;

Disclaimer: queries above are not supposed to be totally working, they just hint possible approaches to the problem.