0
votes

Any suggestions to improve the performance by either improving the graph structure or query for the following? Ideally, I'd like to get this to sub 1s. Right now the best I can get it ~8s with 7M db hits over about 2M nodes and 10M rels.

I've got graph structure where

(co:Company)<-[:HAS_PROVIDER]-(c:Customer)-[:HAS_CLAIM]->(c:Claim)

  • each claim has a boolean property of "direct" as well as a property called "amount"
  • each company has a name
  • Customers can have multiple Claims but only one Company

I want to be able to the show:

  • distinct company name
  • customer count for company
  • count of direct=true claims
  • sum of amount for direct=true claims
  • count of direct=false claims
  • sum of amount for direct=false claims

To achieve this, I've used two approaches:

a) created a relationship from Customer to Claim where direct=true as :IS_DIRECT and direct=false as :IS_INDIRECT

b) labeled each direct=true claim as a :DirectClaim node and direct=false claim as :InDirectClaim node

using (a) allows me to get company name, customer count, size of (IS_DIRECT) and size (IS_INDIRECT) by filter on rels TYPE. however, getting the sum(amount) using combinations of extract, filter, reduce times out no matter the configuration.

using (b) works but takes ~10s

EDIT:

query for (a) looks like (with the hattip to @cybersam this is now ~6s)

MATCH (co:Company)<-[:HAS_PROVIDER]-(c)-[r:IS_DIRECT|IS_INDIRECT]->(cl)
WITH distinct co, collect(r) as rels, count (distinct c) as cntc, collect(mc) as claims
WITH co, cntc, 
size(filter(r in rels WHERE TYPE(r) = 'IS_DIRECT')) as dcls, 
size(filter(r in rels WHERE TYPE(r) = 'IS_INDIRECT')) as indcls,
REDUCE
  (s = {dclsamt: 0, indclsamt: 0}, x IN claims | 
  CASE WHEN x.direct 
  THEN {dclsamt: s.dclsamt + x.amount, indclsamt: s.indclsamt} 
  ELSE {dclsamt: s.dclsamt, indclsamt: s.indclsamt + x.amount} 
  END) 
  AS data
RETURN co.name  as name,cntc, dcls,indcls, data
ORDER BY dcls desc

query for (b) looks like:

MATCH (co:Company)<-[:HAS_PROVIDER]-(c)-[:HAS_CLAIM]->(cl)
WITH distinct co, count (distinct c) as cntc, COLLECT(cl) as cls
WITH co,cntc, 
FILTER(n in cls WHERE 'DirectClaim' IN IN LABELS(n)) as dcls,
FILTER(n in cls WHERE 'InDirectClaim' IN LABELS(n)) as indcls
WITH co,cntc, size(dcls) as dclsct, size(indcls) as indclsct, 
REDUCE(s = 0 , x IN dcls | s + x.amount) as dclsamt, 
REDUCE(s = 0 , x IN indcls | s + x.amount) as indclsamt
RETURN co.name as name, cntc, dclsct, dclsamt, indclsct, indclsamt
2
Can you provide the queries you're currently using for both approach a and b?InverseFalcon
@InverseFalcon added queriesgmjordan

2 Answers

2
votes

There is no need to add extra data (like redundant relationships or labels) to your data model.

This query shows one way to return the results you want (in the returned data map, t is the sum of the true amounts, and tc is the count of the true amounts; similarly for f and fc):

MATCH (co:Company)<-[:HAS_PROVIDER]-(cu:Customer)-[:HAS_CLAIM]->(cl:Claim)
WITH co.name as comp_name, COUNT(DISTINCT cu) AS cust_count,
  REDUCE(s = {t: 0, tc: 0, f: 0, fc: 0}, x IN COLLECT(cl) |
    CASE WHEN x.direct
      THEN {t: s.t + x.amount, tc: s.tc + 1, f: s.f, fc: s.fc}
      ELSE {t: s.t, tc: s.tc, f: s.f + x.amount, fc: s.fc + 1}
    END) AS data
RETURN comp_name, cust_count, data
1
votes

Your current data model is more than sufficient to generate the results you want.

MATCH (co:Company)<-[:HAS_PROVIDER]-(cust:Customer)-[:HAS_CLAIM]->(claim:Claim)
WITH co, COUNT(DISTINCT cust) AS custs, COLLECT(DISTINCT claim) AS claims
WITH co,
custs,
[x IN claims WHERE x.direct|x.amount] AS direct_amts,
[x IN claims WHERE NOT x.direct|x.amount] AS indirect_amts
RETURN co,
custs,
SIZE(direct_amts) AS direct_count,
REDUCE(s=0, x IN direct_amts| s+x) AS direct_amt_total,
SIZE(indirect_amts) AS indirect_count,
REDUCE(s=0, x IN indirect_amts| s+x) AS indirect_amt_total

If you really need speed, make sure you have an index on :Claim(direct) and :Claim(amount) and this will really scream. Alternately, turn the boolean property into a second label (i.e., (claim:Claim:Direct))and you can save yourself an index.

UPDATE: Based on what you're citing, your only real avenues for improvement are going to be based on your usage profile. If this graph is "living" and will be continuously updated, you can always cache the count and amount total on the :Customer node whenever you add, remove, or alter a :Claim. That's where graphs shine, small frequent queries that touch a subset of the database. So when you do anything to a claim, rerun your aggregation for just the corresponding :Customer, store the results on the Customer as properties, and then for your big report, just grab these properties directly from the (far fewer in number) :Customer nodes.