Neo4j indexing for large number of nodes

Question

I am learning the basics of neo4j and I am looking at the following example with credit card fraud https://linkurio.us/stolen-credit-cards-and-fraud-detection-with-neo4j. Cypher query that finds stores where all compromised user shopped is

MATCH (victim:person)-[r:HAS_BOUGHT_AT]->(merchant)
WHERE r.status = “Disputed”
MATCH victim-[t:HAS_BOUGHT_AT]->(othermerchants)
WHERE t.status = “Undisputed” AND t.time < r.time
WITH victim, othermerchants, t ORDER BY t.time DESC
RETURN DISTINCT othermerchants.name as suspicious_store, count(DISTINCT t) as count, collect(DISTINCT victim.name) as victims
ORDER BY count DESC

However, when the number of users increase (let's say to millions of users), this query may become slow since the initial query will have to traverse through all nodes labeled person. Is it possible to speed up the query by asigning properties to nodes instead of transactions? I tried to remove "status" property from relationships and add it to nodes (users, not merchants). However, when I run query with constraint WHERE victim.status="Disputed" query doesn't return anything. So, in my case person has one additional property 'status'. I assume I did a lot of things wrong, but would appreciate comments. For example

 MATCH (victim:person)-[r:HAS_BOUGHT_AT]->(merchant)
 WHERE victim.status = “Disputed”

returns the correct number of disputed transactions. The same holds for separately quering number of undisputed transactions. However, when merged, they yield an empty set.

If I made a mistake in my approach, how can I speed up queries for large number of nodes (avoid traversing all nodes in the first step). I will be working with a data set with similar properties, but will have around 100 million users, so I would like to index users on additional properties.

cybersam cybersam · Accepted Answer · 2014-11-18T01:34:38

[Edited]

Moving the status property from the relationship to the person node does not seem to be the right approach, since I presume the same person can be a customer of multiple merchants.

Instead, you can reify the relationship as a node (let's label it purchase), as in:

(:person)-[:HAS_PURCHASE]->(:purchase)-[:BOUGHT_AT]->(merchant)

The purchase nodes can have the status property. You just have to create the index:

CREATE INDEX ON :purchase(status)

Also, you can put the time property in the new purchase nodes.

With the above, your query would become:

MATCH (victim:person)-[:HAS_PURCHASE]->(pd:purchase)-[:BOUGHT_AT]->(merchant)
WHERE pd.status = “Disputed”
MATCH victim-[:HAS_PURCHASE]->(pu:purchase)-[:BOUGHT_AT]->(othermerchants)
WHERE pu.status = “Undisputed” AND pu.time < pd.time
WITH victim, othermerchants, pu ORDER BY pu.time DESC
RETURN DISTINCT othermerchants.name as suspicious_store, count(DISTINCT pu) as count, collect(DISTINCT victim.name) as victims
ORDER BY count DES

Neo4j indexing for large number of nodes

1 Answers