0
votes

Overview

I am using Neo4j desktop browser to create a graph of page relationships within a website. I'm sure csv load makes this more efficient, but doesn't seem like this query should cause as many problems as it does.

  1. Creation of nodes takes longer than expected (syntax preference?)
  2. Relationship creation spins and times out/crashes

. . .

Problem 1

Creation of nodes takes longer than expected (syntax preference?)

I am creating about 6,500 very basic nodes (1 piece of information within each):

create (a1:link {description:"www.samplelink.com/example1"})

I am building my query in Excel and copy-pasting it into the neo4j browser. I can construct it one of two ways:

create (a1:link {description:"www.samplelink.com/example1"})
create (a2:link {description:"www.samplelink.com/example2"})
create (a3:link {description:"www.samplelink.com/example3"})
...x6,000

OR

create (a1:link {description:"www.samplelink.com/example1"}),
(a2:link {description:"www.samplelink.com/example2"}),
(a3:link {description:"www.samplelink.com/example3"}),
...x6,000

Q: Is there a preferred syntax? What's the advantage to each? 6,500 nodes (especially basic ones without a lot of information), doesn't seem like there would be a massive performance improvement. The query takes anywhere between 5 mins and 15+ minutes with the program's stated actual runtime either 7,000 ms or 47,000 ms. But actual browser spinning takes MUCH longer than the stated final runtime.

. . .

Problem 2

Relationship creation spins and times out/crashes

I construct (what I interpret are) very simple match clauses to assign the nicknames. The string matches are literal (with no regexp), there's no graph traversal, and the relationships are straightforward.

match (a1:link {description:"www.samplelink.com/example1"})
match (a2:link {description:"www.samplelink.com/example2"})
match (a3:link {description:"www.samplelink.com/example3"})
...x6,000

create (a1)-[:REF]->(a3)
create (a1)-[:REF]->(a47)
create (a5832)-[:REF]->(a9)
...x5,000

This query runs for 2+ hours and then crashes/times out.

Q: Again syntax-wise, am I doing something incredibly memory hungry? Should this be written a slightly different way? One MATCH phrase with commas? One CREATE phrase for the relationships?

. . .

My reading materials

1. I considered this article on cardinality:

https://neo4j.com/developer/kb/understanding-cypher-cardinality/

It seems like maybe I'm accidentally creating a massive cross-product of relationships rather than each single relationship as intended...? I also don't know whether the MATCH syntax is doing something funny with the way neo4j outputs "rows", holds those in memory, and then does the desired operation on each row.

Is it more efficient to do the MATCH within one MATCH phrase? Same with the CREATE for the relationships.

MATCH (a1:link {desc:"alpha"}),
(a2:link {desc:"beta"}),
(a3:link {desc:"gamma"})

2. Indexes

I saw a lot of places people commenting on other spinning query posts, to create an index.

I did try to create an index CREATE INDEX ON :link(description), but coming from a SQL background, I don't understand how this would materially speed up a query with only 6,500 literal string matches.

3. Similar hang problem

Neo4j crashes on batch import

Approved answer, third point, suggests breaking it into smaller transactions of 100 per MATCH/CREATE. I guess I could do this? It seems like a lot of fiddling in Excel to make sure my MATCH clause includes the proper nodes for the CREATE sections. Seems like neo4j should be able to handle 6,500 nodes and 5,000 basic relationships in memory...I'm not doing anything advanced here.

Updates

I am re-running the query now, in the "MATCH node, node, node" format not "MATCH node MATCH node MATCH node" format. I only have 1 CREATE statement, a random relationship between 2 nodes. This (apparently massive) MATCH clause with a single CREATE clause is taking 15+ minutes. So I think it's a matter of holding all the nodes in memory that's the problem.

Query ends with an error: "Neo.TransientError.General.StackOverFlowError - There is not enough stack size to perform the current task. This is generally considered to be a database error, so please contact Neo4j support. You could try increasing the stack size: for example to set the stack size to 2M, add `dbms.jvm.additional=-Xss2M' to in the neo4j configuration (normally in 'conf/neo4j.conf' or, if you are using Neo4j Desktop, found through the user interface) or if you are running an embedded installation just add -Xss2M as command line flag."

I constructed it extremely basic MATCH node1 MATCH node2 CREATE (node1)-[:REL]->(node2); and stringing these queries together. Each mini-query runs consecutively, but in my Neo4j browser it take literally 2 seconds per query (after a 30 second warm-up to process/compile the initial query). 300 queries will take 10 mins at this rate. And I have 5,000 statements to get through. There has to be a more efficient way when people are creating graphs with thousands/millions/billions of nodes. Is it as simple as "Don't use Neo4j browser?" and use csv load?

2
Please edit your question to include any additional information within the question itself, not as comments.David Makogon

2 Answers

1
votes

Problem 1: You should pass a list of all the description values in a parameter to the query. And the query can just use UNWIND to get the elements from that list. The query will be very small and execute quicker (and also avoid Cypher injection attacks).

For example (if the list is passed in a descriptions parameter):

UNWIND $descriptions AS desc
CREATE (a1:link {description: desc})

You may want to break up very large lists into smaller chunks, but 6500 is not very large.

Problem 2: You can use @TomažBratanič's approach, or you can use an approach similar to my approach for Problem 1. That is, you could pass a list of pairs of description values to your query.

For example, if each element of the descriptionPairs parameter is a list of 2 description values:

UNWIND $descriptionPairs AS descPair
MATCH (a1:link {description: descPair[0]})
MATCH (a2:link {description: descPair[1]})
CREATE (a1)-[:REF]->(a2)

And, to make this query really fast, you should also create an index on :link(description).

NOTE: If you want to avoid creating duplicate nodes or relationships, you should use MERGE instead of CREATE for both of my approaches. You should carefully read the documentation for MERGE so that you understand how to use it properly, but the above queries are simple enough that replacing CREATE with MERGE is safe.

0
votes

Instead of preprocessing the data to be in cypher format like:

match (a1:link {description:"www.samplelink.com/example1"})
match (a2:link {description:"www.samplelink.com/example2"})
match (a3:link {description:"www.samplelink.com/example3"})
...x6,000

create (a1)-[:REF]->(a3)
create (a1)-[:REF]->(a47)
create (a5832)-[:REF]->(a9)
...x5,000

You want to preprocess your data into a CSV file like for example:

link_from, link_to
samplelink1,samplelink2

And then use the LOAD CSV statement to import the data:

LOAD CSV WITH HEADER FROM "file:///yourfile.csv" as row
MERGE (from:link{description:row.link_from})
MERGE (to:link{description:row.link_to})
MERGE (from)-[:REF]->(to)

With the proper index setup, the import should take like a second.