1
votes

I am trying to evaluate Neo4j (using the community version).
I am importing some data (1 million rows) using the LOAD CSV process. It needs to match previously imported nodes to create a relationship between them.

Here is my query:

//Query #3
//create edges between Tr and Ad nodes

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///1M.txt'
AS line
 FIELDTERMINATOR '\t'

//find appropriate tx and ad
MATCH (tx:Tr { txid: TOINT(line.txid) }), (ad:Ad {p58: line.p58})

//create the edge (relationship)
CREATE (tx)-[out:OUT_TO]->(ad)

//set properties on the edge
SET out.id= TOINT(line.id)
SET out.n = TOINT(line.n)
SET out.v = TOINT(line.v)

I have indicies on:

Indexes
  ON :Ad(p58)       ONLINE (for uniqueness constraint) 
  ON :Tr(txid)      ONLINE                             
  ON :Tr(h)         ONLINE (for uniqueness constraint)

This query has been running for 5 days now and it has so far created 270K relationships (out of 1M).
Java heap is 4g
Machine has 32G of RAM and an SSD for a drive, only running linux and Neo4j

Any hints to speed this process up would be highly appreciated.
Should I try the enterprise edition?

Query Plan:

+--------------------------------------------+
| No data returned, and nothing was changed. |
+--------------------------------------------+
If a part of a query contains multiple disconnected patterns, 
this will build a cartesian product between all those parts.
This may produce a large amount of data and slow down query processing.
While occasionally intended, 
it may often be possible to reformulate the query that avoids the use of this cross product,
 perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (ad))
20 ms

Compiler CYPHER 3.0

Planner COST

Runtime INTERPRETED

+---------------------------------+----------------+---------------------+----------------------------+
| Operator                        | Estimated Rows | Variables           | Other                      |
+---------------------------------+----------------+---------------------+----------------------------+
| +ProduceResults                 |              1 |                     |                            |
| |                               +----------------+---------------------+----------------------------+
| +EmptyResult                    |                |                     |                            |
| |                               +----------------+---------------------+----------------------------+
| +Apply                          |              1 | line -- ad, out, tx |                            |
| |\                              +----------------+---------------------+----------------------------+
| | +SetRelationshipProperty(4)   |              1 | ad, out, tx         |                            |
| | |                             +----------------+---------------------+----------------------------+
| | +CreateRelationship           |              1 | out -- ad, tx       |                            |
| | |                             +----------------+---------------------+----------------------------+
| | +ValueHashJoin                |              1 | ad -- tx            | ad.p58; line.p58           |
| | |\                            +----------------+---------------------+----------------------------+
| | | +NodeIndexSeek              |              1 | tx                  | :Tr(txid)                  |
| | |                             +----------------+---------------------+----------------------------+
| | +NodeUniqueIndexSeek(Locking) |              1 | ad                  | :Ad(p58)                   |
| |                               +----------------+---------------------+----------------------------+
| +LoadCSV                        |              1 | line                |                            |
+---------------------------------+----------------+---------------------+----------------------------+
1
Can you add the query plan (the results of adding EXPLAIN to the beginning of your query)?William Lyon

1 Answers

2
votes

OKAY, so by splitting the MATCH statement into two it sped up the query immensely. Thanks @William Lyon for pointing me to the Plan. I noticed the warning.

old MATCH atatement

MATCH (tx:Tr { txid: TOINT(line.txid) }), (ad:Ad {p58: line.p58})

split into two:

MATCH (tx:Tr { txid: TOINT(line.txid) })
MATCH (ad:Ad {p58: line.p58})

on 750K relationships the query took 83 seconds.
Next up 9 Million CSV LOAD