How can I use neo4j cypher query to create a histogram of nodes bucketed by number of relationships?

3

votes

I have a big bunch of nodes which match the following cypher:

(:Word)<-[:Searched]-(:Session)

I want to make a histogram of the number of Word nodes at each frequency of Searched relationships.

I want to make this sort of chart:

Searches Words
0        100
1-5      200
6-10     150
11-15    50
16-20    25

I'm just starting out with neo4j, and I'm not sure how to approach this or even if there is a way to specify this in cypher. The closest I've got is to count the relationships and get averages.

MATCH (n:Word) 
RETURN
DISTINCT labels(n),
count(*) AS NumofNodes,
avg(size((n)<-[:Searched]-())) AS AvgNumOfRelationships,
min(size((n)<-[:Searched]-())) AS MinNumOfRelationships,
max(size((n)<-[:Searched]-())) AS MaxNumOfRelationships

That is based on an example here: https://neo4j.com/developer/kb/how-do-i-produce-an-inventory-of-statistics-on-nodes-relationships-properties/

I've also seen use of the modulus operator for grouping to get buckets, though I'm not sure how to use that in reference to the count: Neo4j cypher time interval histogram query of time tree

Is there a "best" way to do this?

neo4jstatisticscypherhistogram

2

votes

The following should work:

WITH 5 AS gSize
MATCH (w:Word)
OPTIONAL MATCH (w)<-[s:Searched]-()
WITH gSize, w, TOINT((COUNT(s) + (gSize-1))/gSize * gSize) AS m
RETURN
  CASE m WHEN 0 THEN '0' ELSE (m-gSize+1)+'-'+m END AS range,
  COUNT(*) AS ct
ORDER BY range;

With the sample data provided by @GaborSzarnyas, the output is:

+-------------+
| range  | ct |
+-------------+
| "0"    | 1  |
| "1-5"  | 1  |
| "6-10" | 1  |
+-------------+

1

votes

I was able to figure out a query which I think gets me the data I want:

MATCH (n:Word) 
WITH n, 5 AS bucketsize
WITH (FLOOR(SIZE( (n)<-[:Searched]-() ) / bucketsize) * bucketsize) AS numRels
RETURN numRels, COUNT(*)
ORDER BY numRels ASC

It doesn't get the zero row, which I'd like to have, but it seems like it works otherwise. Hopefully someone else has a better solution.

1

votes

I created a simple example dataset of three words: w1 with no searches, w2 with 3 searches and w3 with 6.

CREATE (w1:Word {w: '1'})
WITH count(*) AS dummy

CREATE (w2:Word {w: '2'}) WITH w2
UNWIND range(1, 3) AS i
CREATE (w2)<-[:Searched]-(:Session)
WITH count(*) AS dummy

CREATE (w3:Word {w: '3'}) WITH w3
UNWIND range(1, 6) AS i
CREATE (w3)<-[:Searched]-(:Session)

I would approach it like this: first, let's create a list with the upper limits for each bucket:

RETURN [i IN range(0, 4) | i*5] AS upperLimits

╒══════════════╕
│"upperLimits" │
╞══════════════╡
│[0,5,10,15,20]│
└──────────────┘

Second, use this with a list comprehension that selects the elements from the list that has a sufficiently large upper limit. The first one of these is our bucket, so we select that with the [0] list indexer. The rest is just calculating the lower limit and ordering rows:

WITH [i IN range(0, 4) | i*5] AS upperLimits
MATCH (n:Word) 
WITH upperLimits, ID(n) AS n, size((n)<-[:Searched]-()) AS numOfRelationships
WITH
  [upperLimit IN upperLimits WHERE numOfRelationships <= upperLimit][0] AS upperLimit,
  count(n) AS count
RETURN
  upperLimit - 4 AS lowerLimit,
  upperLimit,
  count
ORDER BY lowerLimit

The query gives the following results:

╒════════════╤════════════╤═══════╕
│"lowerLimit"│"upperLimit"│"count"│
╞════════════╪════════════╪═══════╡
│-4          │0           │1      │
├────────────┼────────────┼───────┤
│1           │5           │1      │
├────────────┼────────────┼───────┤
│6           │10          │1      │
└────────────┴────────────┴───────┘

Potential improvements:

(1) If the value of numOfRelationships is larger than the largest upper limit, the query above will return the first element of an empty list, which is null. To avoid that, either 1) set a sufficiently large upper limit, e.g.

MATCH (n:Word) 
WITH max(size((n)<-[:Searched]-())) AS maxNumberOfRelationShips
WITH [i IN range(-1, maxNumberOfRelationShips/5+1) | {lower: i*5-4, upper: i*5}] AS limits
RETURN *

You can use the top bucket with "16 or larger" semantics with coalesce.

(2) -4 as a lower limit is not very nice, we can use CASE to get rid of it.

Putting all this together, we get this:

MATCH (n:Word) 
WITH max(size((n)<-[:Searched]-())) AS maxNumberOfRelationShips
WITH [i IN range(0, maxNumberOfRelationShips/5+1) | i*5] AS upperLimits
MATCH (n:Word) 
WITH upperLimits, ID(n) AS n, size((n)<-[:Searched]-()) AS numOfRelationships
WITH
  [upperLimit IN upperLimits WHERE numOfRelationships <= upperLimit][0] AS upperLimit,
  count(n) AS count
RETURN 
  CASE WHEN upperLimit - 4 < 0 THEN 0 ELSE upperLimit - 4 END AS lowerLimit,
  upperLimit,
  count
ORDER BY lowerLimit

Which results in:

╒════════════╤════════════╤═══════╕
│"lowerLimit"│"upperLimit"│"count"│
╞════════════╪════════════╪═══════╡
│0           │0           │1      │
├────────────┼────────────┼───────┤
│1           │5           │1      │
├────────────┼────────────┼───────┤
│6           │10          │1      │
└────────────┴────────────┴───────┘

1

votes

What I usually do in this scenario is that I use the setting in neo4j that if you divide an integer by an integer you get back an integer. This simplifies the query alot. We add a special case for 0 and it all fits in one line.

WITH [0,1,5,7,9,11] as list
UNWIND list as x
WITH CASE WHEN x = 0 THEN -1 ELSE  (x / 5) * 5 END as results
return results

This returns

-1, 0, 5, 5, 5, 10

Which is not ideal given that you want to group 1-5 together but good enough i guess.

How can I use neo4j cypher query to create a histogram of nodes bucketed by number of relationships?

4 Answers