Cypher Neo4j - Query that uses the clause 'IN' on the collection is very slow

Question

Hi i'm trying to import some data from CSV files in Neo4j 2.3.1. I've already imported some nodes of type :Author and :Article.

The Author node is composed of properties like:

key -> String
principal_name -> String
alias -> Collection of String
........

I've also added index on principal_name, alias and key.

The problem comes when I try to import the relationships between nodes of type Article and Author.

The CSV has this type of structure:

articleKey,authorName

Has a naive solution i've tried to create the relationship using a query like this one:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///myPath.csv" AS line
MATCH (art:Article{key: line.key1})
MATCH (auth:Author) WHERE line.key2 IN (auth.alias)
CREATE UNIQUE (auth)-[:AUTHOR_OF]->(art);

The query is painfully slow because the second MATCH is really slow as i discovered using the profiler. It takes 10-12 seconds to create every relation because i've many Authors in the db(around 1000000).

So i'm looking for a way to execute a query like this one to get a faster execution(is an example to illustrate the structure that i want to obtain):

MATCH (auth:Author{principal_name: line.key2})
IF auth null THEN
  MATCH (auth:Author) WHERE line.key2 IN (auth.alias)
END

There is a way to do that with Cypher ?

cybersam cybersam · Accepted Answer · 2015-12-03T18:23:50

If you changed your model so that all of an Author node's names (both the principal name and all the aliases) are all in separate Name nodes, like this:

(auth:Author)-[:HAS_NAME]->(name:Name {name: 'Fred McGillicutty'})

Then the query would be simply:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///myPath.csv" AS line
MATCH
  (art:Article { key: line.key1 }),
  (auth:Author)-[:HAS_NAME]->(name:Name { name:line.key2 })
CREATE (auth)-[:AUTHOR_OF]->(art);

If you create indexes on :Article(key), and :Name(name), this query should be very efficient.

Cypher Neo4j - Query that uses the clause 'IN' on the collection is very slow

2 Answers