How to optimize a Neo4J Cypher query?

Question

I have an app that converts text into network, so that when a sentence is added every word is a node and every co-occurrence of words is the connection between them. This information is important to better understand the question below.

In order to add every sentence into the Neo4J database, I have the following Cypher query in Neo4J, which, according to my data structure, first matches the user who's adding the nodes, then matches the context (or list) where the statement is made, links it to the user, links the statement to the user and to the context, and then creates connections between every node added (with properties), the statement, where they were made and the context (list) in which they were made.

The problem is that this query is about 100 longer than the sentence itself, so if a text is 400Bytes, the query is about 40K. When I want to add a long text, then Neo4J starts to be very slow.

Therefore — my question: how would I optimize this query in the best way? Do you recommend to make a set of transactions instead?

Can I, for example, cut each long query into many parts and then send a few transactions concurrently to save time?

I'm talking about a text which would be about 100K long, maybe longer. So that means that the total request would be about 10Mb long.

MATCH (u:User {uid: "6e228580-1cb3-11e8-8271-891867c15336"}) 
MERGE (c_list:Context {name:"list",by:"6e228580-1cb3-11e8-8271-891867c15336",
uid:"0b4fa320-1dfd-11e8-802e-b5cbdf950c47"}) 
ON CREATE SET c_list.timestamp="15199833288930000" 
MERGE (c_list)-[:BY{timestamp:"15199833288930000"}]->(u) 
CREATE (s:Statement {name:"#apple #orange #fruit", 
text:"apples and oranges are fruits", 
uid:"0b56a800-1dfd-11e8-802e-b5cbdf950c47", timestamp:"15199833288930000"}) 
CREATE (s)-[:BY {context:"0b4fa320-1dfd-11e8-802e-b5cbdf950c47",
timestamp:"15199833288930000"}]->(u) 
CREATE (s)-[:IN {user:"6e228580-1cb3-11e8-8271-891867c15336",
timestamp:"15199833288930000"}]->(c_list) 
MERGE (cc_apple:Concept {name:"apple"}) 
ON CREATE SET cc_apple.timestamp="15199833288930000", cc_apple.uid="0b56a801-1dfd-11e8-802e-b5cbdf950c47" 
MERGE (cc_orange:Concept {name:"orange"}) 
ON CREATE SET cc_orange.timestamp="15199833288930000", cc_orange.uid="0b56cf10-1dfd-11e8-802e-b5cbdf950c47" 
MERGE (cc_fruit:Concept {name:"fruit"}) 
ON CREATE SET cc_fruit.timestamp="15199833288930002", cc_fruit.uid="0b56cf13-1dfd-11e8-802e-b5cbdf950c47" 
CREATE (cc_apple)-[:BY {context:"0b4fa320-1dfd-11e8-802e-b5cbdf950c47",timestamp:"15199833288930000",
statement:"0b56a800-1dfd-11e8-802e-b5cbdf950c47"}]->(u) 
CREATE (cc_apple)-[:OF {context:"0b4fa320-1dfd-11e8-802e-b5cbdf950c47",user:"6e228580-1cb3-11e8-8271-891867c15336",timestamp:"15199833288930000"}]->(s)  
CREATE (cc_apple)-[:AT {user:"6e228580-1cb3-11e8-8271-891867c15336",timestamp:"15199833288930000",
context:"0b4fa320-1dfd-11e8-802e-b5cbdf950c47",statement:"0b56a800-1dfd-11e8-802e-b5cbdf950c47"}]->(c_list) 
CREATE (cc_apple)-[:TO {context:"0b4fa320-1dfd-11e8-802e-b5cbdf950c47",
statement:"0b56a800-1dfd-11e8-802e-b5cbdf950c47",user:"6e228580-1cb3-11e8-8271-891867c15336",
timestamp:"15199833288930000",uid:"0b56cf11-1dfd-11e8-802e-b5cbdf950c47",gapscan:"2",weight:"3"}]->(cc_orange) 
CREATE (cc_orange)-[:BY {context:"0b4fa320-1dfd-11e8-802e-b5cbdf950c47",timestamp:"15199833288930000",statement:"0b56a800-1dfd-11e8-802e-b5cbdf950c47"}]->(u) 
CREATE (cc_orange)-[:OF {context:"0b4fa320-1dfd-11e8-802e-b5cbdf950c47",user:"6e228580-1cb3-11e8-8271-891867c15336",timestamp:"15199833288930000"}]->(s) 
CREATE (cc_orange)-[:AT {user:"6e228580-1cb3-11e8-8271-891867c15336",timestamp:"15199833288930000",
context:"0b4fa320-1dfd-11e8-802e-b5cbdf950c47",statement:"0b56a800-1dfd-11e8-802e-b5cbdf950c47"}]->(c_list) 
CREATE (cc_orange)-[:TO {context:"0b4fa320-1dfd-11e8-802e-b5cbdf950c47",
statement:"0b56a800-1dfd-11e8-802e-b5cbdf950c47",user:"6e228580-1cb3-11e8-8271-891867c15336",
timestamp:"15199833288930002",uid:"0b56cf14-1dfd-11e8-802e-b5cbdf950c47",gapscan:"2",weight:"3"}]->(cc_fruit) 
CREATE (cc_apple)-[:TO {context:"0b4fa320-1dfd-11e8-802e-b5cbdf950c47",
statement:"0b56a800-1dfd-11e8-802e-b5cbdf950c47",user:"6e228580-1cb3-11e8-8271-891867c15336",
timestamp:"15199833288930002",uid:"0b56cf16-1dfd-11e8-802e-b5cbdf950c47",gapscan:"4",weight:"2"}]->(cc_fruit) 
CREATE (cc_fruit)-[:BY {context:"0b4fa320-1dfd-11e8-802e-b5cbdf950c47",
timestamp:"15199833288930002",statement:"0b56a800-1dfd-11e8-802e-b5cbdf950c47"}]->(u) 
CREATE (cc_fruit)-[:OF {context:"0b4fa320-1dfd-11e8-802e-b5cbdf950c47",user:"6e228580-1cb3-11e8-8271-891867c15336",timestamp:"15199833288930002"}]->(s) 
CREATE (cc_fruit)-[:AT {user:"6e228580-1cb3-11e8-8271-891867c15336",
timestamp:"15199833288930002",context:"0b4fa320-1dfd-11e8-802e-b5cbdf950c47",
statement:"0b56a800-1dfd-11e8-802e-b5cbdf950c47"}]->(c_list)  
RETURN s.uid;

This seems very manual. Do you have any means of splitting the text string and handling all tokens at once? Also, where are all these values coming from? It's hard to tell what's meant to be input to your query, what's meant to be auto-generated (uuids?) and what's meant to be hardcoded. You should really use parameters and variables to reduce redundancy if possible. — InverseFalcon
Yes, I can split the text string but I don't quite understand what you mean by handling all tokens at once... The values are just the unique connections and the UUIDS are auto-generated IDs to avoid that they coincide.. And what do you mean by using parameters and variables? The query itself is of course generated by a script (node.js) but this is how it looks when it's send to Neo4J... — Aerodynamika
Here's the docs for parameters. Variables work similarly. I was referring to using parameters and/or variables for the strings and values in the query, so you wouldn't have to repeat the same uuids over and over again, for example, you'd just use the parameterized value. — InverseFalcon
One major thing to highlight, a query that changes (especially drastically in length) as the input changes is a warning that your approach is likely flawed and your query needs fixing so you're not explicitly handling each part of the query. If you note stdob--'s query, it's constant no matter how large the sentence. — InverseFalcon

stdob-- stdob-- · Accepted Answer · 2018-03-02T12:04:11

1) Use the input parameters:

var params = {
    userId: "6e228580-1cb3-11e8-8271-891867c15336",
    contextName: "list",
    time: "15199833288930000",
    statementName: "#apple #orange #fruit",
    statementText: "apples and oranges are fruits",
    concepts: ["apple", "orange", "fruit"],
    conceptsRelations: [
        {from: "apple",  to: "orange", gapscan: 2, weight: 3},
        {from: "orange", to: "fruit",  gapscan: 2, weight: 3},
        {from: "apple",  to: "fruit",  gapscan: 4, weight: 2}
    ]
}
session.run(cypherQuery, params).then...

2) Use the APOC library to generate unique identifiers on the database side: apoc.create.uuid()

3) Use cycles (foreach and unwind) for repetitive operations:

MATCH (u:User {uid: $userId})
MERGE (c_list:Context {name: $contextName, by: $userId})
    ON CREATE SET c_list.timestamp = $time,
                  c_list.uid = apoc.create.uuid()
MERGE (c_list)-[:BY{timestamp: $time}]->(u)

CREATE (s:Statement {name: $statementName, 
                     text: $statementText, uid:apoc.create.uuid(), timestamp: $time})
CREATE (s)-[:BY {context: c_list.uid, timestamp: $time}]->(u)
CREATE (s)-[:IN {user: u.uid, timestamp: $time}]->(c_list)

FOREACH (conceptName in $concepts|
    MERGE (concept:Concept {name: conceptName})
        ON CREATE SET concept.timestamp = $time,
                      concept.uid = apoc.create.uuid()
    CREATE (concept)-[:BY {context: c_list.uid, timestamp: $time, statement: s.uid}]->(u)
    CREATE (concept)-[:OF {context: c_list.uid, user: u.uid, timestamp: $time}]->(s)
    CREATE (concept)-[:AT {user: u.uid, timestamp: $time, 
                           context: c_list.uid, statement: s.uid}]->(c_list)
)

WITH u, c_list, s

UNWIND $conceptsRelations as conceptsRelation
  MATCH (c_from:Concept {name: conceptsRelation.from})
  MATCH (c_to:Concept {name: conceptsRelation.to})
  CREATE (c_from)-[:TO {context: c_list.uid, statement: s.uid, user: u.uid,
                        timestamp: $time, uid: apoc.create.uuid(), 
                        gapscan: conceptsRelation.gapscan, 
                        weight: conceptsRelation.weight}]->(c_to)
RETURN distinct s.uid;

How to optimize a Neo4J Cypher query?

1 Answers