1
votes

Background

I want to create a histogram of the relationships starting from a set of nodes. Input is a set of node ids, for example set = [ id_0, id_1, id_2, id_3, ... id_n ]. The output is a the relationship type histogram for each node (e.g. Map<Long, Map<String, Long>>):

id_0:
- ACTED_IN: 14
- DIRECTED: 1

id_1:
- DIRECTED: 12
- WROTE: 5
- ACTED_IN: 2

id_2:
...

The current cypher query I've written is:

MATCH (n)-[r]-()
WHERE id(n) IN [ id_0, id_1, id_2, id_3, ... id_n ] # set
RETURN id(n) as id, type(r) as type, count(r) as count

It returns the pair of [ id, type ] count like:

id   |  rel type  | count
id0  |  ACTED_IN  |    14
id0  |  DIRECTED  |     1
id1  |  DIRECTED  |    12
id1  |  WROTE     |     5
id1  |  ACTED_IN  |     2
...

The result is collected using java and merged to the first structure (e.g. Map<Long, Map<String, Long>>).

Problem

Getting the relationship histogram on smaller graphs is fast but can be very slow on bigger datasets. For example if I want to create the histogram where the set-size is about 100 ids/nodes and each of those nodes have around 1000 relationships the cypher query took about 5 minutes to execute.

  1. Is there more efficient way to collect the histogram for a set of nodes?
  2. Could this query be parallelized? (With java code or using UNION?)
  3. Is something wrong with how I set up my neo4j database, should these queries be this slow?
1

1 Answers

4
votes

There is no need for parallel queries, just the need to understand Cypher efficiency and how to use statistics.

Bit of background :

Using count, will execute an expandAll, which is as expensive as the number of relationships a node has

PROFILE
MATCH (n) WHERE id(n) = 21
MATCH (n)-[r]-(x)
RETURN n, type(r), count(*)

enter image description here

Using size and a relationship type, uses internally getDegree which is a statistic a node has locally, and thus is very efficient

PROFILE
MATCH (n) WHERE id(n) = 0
RETURN n, size((n)-[:SEARCH_RESULT]-())

enter image description here

Morale of the story, for using size you need to know the relationship types a labeled node can have. So, you need to know the schema of the database ( in general you will want that, it makes things easily predictable and building dynamically efficient queries becomes a joy).

But let's assume you don't know the schema, you can use APOC cypher procedures, allowing you to build dynamic queries.

The flow is :

  1. Get all the relationship types from the database ( fast )
  2. Get the nodes from id list ( fast )
  3. Build dynamic queries using size ( fast )
CALL db.relationshipTypes() YIELD relationshipType
WITH collect(relationshipType) AS types
MATCH (n) WHERE id(n) IN [21, 0]
UNWIND types AS type
CALL apoc.cypher.run("RETURN size((n)-[:`" + type + "`]-()) AS count", {n: n})
YIELD value
RETURN id(n), type, value.count

enter image description here