Overview
I am using Neo4j desktop browser to create a graph of page relationships within a website. I'm sure csv load makes this more efficient, but doesn't seem like this query should cause as many problems as it does.
- Creation of nodes takes longer than expected (syntax preference?)
- Relationship creation spins and times out/crashes
. . .
Problem 1
Creation of nodes takes longer than expected (syntax preference?)
I am creating about 6,500 very basic nodes (1 piece of information within each):
create (a1:link {description:"www.samplelink.com/example1"})
I am building my query in Excel and copy-pasting it into the neo4j browser. I can construct it one of two ways:
create (a1:link {description:"www.samplelink.com/example1"})
create (a2:link {description:"www.samplelink.com/example2"})
create (a3:link {description:"www.samplelink.com/example3"})
...x6,000
OR
create (a1:link {description:"www.samplelink.com/example1"}),
(a2:link {description:"www.samplelink.com/example2"}),
(a3:link {description:"www.samplelink.com/example3"}),
...x6,000
Q: Is there a preferred syntax? What's the advantage to each? 6,500 nodes (especially basic ones without a lot of information), doesn't seem like there would be a massive performance improvement. The query takes anywhere between 5 mins and 15+ minutes with the program's stated actual runtime either 7,000 ms or 47,000 ms. But actual browser spinning takes MUCH longer than the stated final runtime.
. . .
Problem 2
Relationship creation spins and times out/crashes
I construct (what I interpret are) very simple match clauses to assign the nicknames. The string matches are literal (with no regexp), there's no graph traversal, and the relationships are straightforward.
match (a1:link {description:"www.samplelink.com/example1"})
match (a2:link {description:"www.samplelink.com/example2"})
match (a3:link {description:"www.samplelink.com/example3"})
...x6,000
create (a1)-[:REF]->(a3)
create (a1)-[:REF]->(a47)
create (a5832)-[:REF]->(a9)
...x5,000
This query runs for 2+ hours and then crashes/times out.
Q: Again syntax-wise, am I doing something incredibly memory hungry? Should this be written a slightly different way? One MATCH phrase with commas? One CREATE phrase for the relationships?
. . .
My reading materials
1. I considered this article on cardinality:
https://neo4j.com/developer/kb/understanding-cypher-cardinality/
It seems like maybe I'm accidentally creating a massive cross-product of relationships rather than each single relationship as intended...? I also don't know whether the MATCH syntax is doing something funny with the way neo4j outputs "rows", holds those in memory, and then does the desired operation on each row.
Is it more efficient to do the MATCH within one MATCH phrase? Same with the CREATE for the relationships.
MATCH (a1:link {desc:"alpha"}),
(a2:link {desc:"beta"}),
(a3:link {desc:"gamma"})
2. Indexes
I saw a lot of places people commenting on other spinning query posts, to create an index.
I did try to create an index CREATE INDEX ON :link(description)
, but coming from a SQL background, I don't understand how this would materially speed up a query with only 6,500 literal string matches.
3. Similar hang problem
Approved answer, third point, suggests breaking it into smaller transactions of 100 per MATCH/CREATE. I guess I could do this? It seems like a lot of fiddling in Excel to make sure my MATCH clause includes the proper nodes for the CREATE sections. Seems like neo4j should be able to handle 6,500 nodes and 5,000 basic relationships in memory...I'm not doing anything advanced here.
Updates
I am re-running the query now, in the "MATCH node, node, node" format not "MATCH node MATCH node MATCH node" format. I only have 1 CREATE statement, a random relationship between 2 nodes. This (apparently massive) MATCH clause with a single CREATE clause is taking 15+ minutes. So I think it's a matter of holding all the nodes in memory that's the problem.
Query ends with an error: "Neo.TransientError.General.StackOverFlowError - There is not enough stack size to perform the current task. This is generally considered to be a database error, so please contact Neo4j support. You could try increasing the stack size: for example to set the stack size to 2M, add `dbms.jvm.additional=-Xss2M' to in the neo4j configuration (normally in 'conf/neo4j.conf' or, if you are using Neo4j Desktop, found through the user interface) or if you are running an embedded installation just add -Xss2M as command line flag."
I constructed it extremely basic MATCH node1 MATCH node2 CREATE (node1)-[:REL]->(node2); and stringing these queries together. Each mini-query runs consecutively, but in my Neo4j browser it take literally 2 seconds per query (after a 30 second warm-up to process/compile the initial query). 300 queries will take 10 mins at this rate. And I have 5,000 statements to get through. There has to be a more efficient way when people are creating graphs with thousands/millions/billions of nodes. Is it as simple as "Don't use Neo4j browser?" and use csv load?