Performing aggregation across unique nodes with multiple relationships or multiple labels

Question

Any suggestions to improve the performance by either improving the graph structure or query for the following? Ideally, I'd like to get this to sub 1s. Right now the best I can get it ~8s with 7M db hits over about 2M nodes and 10M rels.

I've got graph structure where

(co:Company)<-[:HAS_PROVIDER]-(c:Customer)-[:HAS_CLAIM]->(c:Claim)

each claim has a boolean property of "direct" as well as a property called "amount"
each company has a name
Customers can have multiple Claims but only one Company

I want to be able to the show:

distinct company name
customer count for company
count of direct=true claims
sum of amount for direct=true claims
count of direct=false claims
sum of amount for direct=false claims

To achieve this, I've used two approaches:

a) created a relationship from Customer to Claim where direct=true as :IS_DIRECT and direct=false as :IS_INDIRECT

b) labeled each direct=true claim as a :DirectClaim node and direct=false claim as :InDirectClaim node

using (a) allows me to get company name, customer count, size of (IS_DIRECT) and size (IS_INDIRECT) by filter on rels TYPE. however, getting the sum(amount) using combinations of extract, filter, reduce times out no matter the configuration.

using (b) works but takes ~10s

EDIT:

query for (a) looks like (with the hattip to @cybersam this is now ~6s)

MATCH (co:Company)<-[:HAS_PROVIDER]-(c)-[r:IS_DIRECT|IS_INDIRECT]->(cl)
WITH distinct co, collect(r) as rels, count (distinct c) as cntc, collect(mc) as claims
WITH co, cntc, 
size(filter(r in rels WHERE TYPE(r) = 'IS_DIRECT')) as dcls, 
size(filter(r in rels WHERE TYPE(r) = 'IS_INDIRECT')) as indcls,
REDUCE
  (s = {dclsamt: 0, indclsamt: 0}, x IN claims | 
  CASE WHEN x.direct 
  THEN {dclsamt: s.dclsamt + x.amount, indclsamt: s.indclsamt} 
  ELSE {dclsamt: s.dclsamt, indclsamt: s.indclsamt + x.amount} 
  END) 
  AS data
RETURN co.name  as name,cntc, dcls,indcls, data
ORDER BY dcls desc

query for (b) looks like:

MATCH (co:Company)<-[:HAS_PROVIDER]-(c)-[:HAS_CLAIM]->(cl)
WITH distinct co, count (distinct c) as cntc, COLLECT(cl) as cls
WITH co,cntc, 
FILTER(n in cls WHERE 'DirectClaim' IN IN LABELS(n)) as dcls,
FILTER(n in cls WHERE 'InDirectClaim' IN LABELS(n)) as indcls
WITH co,cntc, size(dcls) as dclsct, size(indcls) as indclsct, 
REDUCE(s = 0 , x IN dcls | s + x.amount) as dclsamt, 
REDUCE(s = 0 , x IN indcls | s + x.amount) as indclsamt
RETURN co.name as name, cntc, dclsct, dclsamt, indclsct, indclsamt

Can you provide the queries you're currently using for both approach a and b? — InverseFalcon

cybersam cybersam · Accepted Answer · 2016-10-17T22:40:40

There is no need to add extra data (like redundant relationships or labels) to your data model.

This query shows one way to return the results you want (in the returned data map, t is the sum of the true amounts, and tc is the count of the true amounts; similarly for f and fc):

MATCH (co:Company)<-[:HAS_PROVIDER]-(cu:Customer)-[:HAS_CLAIM]->(cl:Claim)
WITH co.name as comp_name, COUNT(DISTINCT cu) AS cust_count,
  REDUCE(s = {t: 0, tc: 0, f: 0, fc: 0}, x IN COLLECT(cl) |
    CASE WHEN x.direct
      THEN {t: s.t + x.amount, tc: s.tc + 1, f: s.f, fc: s.fc}
      ELSE {t: s.t, tc: s.tc, f: s.f + x.amount, fc: s.fc + 1}
    END) AS data
RETURN comp_name, cust_count, data

Performing aggregation across unique nodes with multiple relationships or multiple labels

2 Answers