I have Neo4j 3.3.5 graph database: 27GB, 50kk nodes, 500kk relations. Indexes on. Schema. PC: 16GB ram, 4 cores.
Task is to find best matching companies for given query data. Nodes :Company, which I need to get, have multiple relations with nodes:Branch's, :Country's, etc. Query data has BranchIds, CountryIds, etc.
Currently I am using cypher like this to get score from one relations (500k rows in result):
MATCH (c:Company)-[r:HAS_BRANCH]->(b:Branch)
WHERE b.branchId in [27444, 1692, 23409, ...] //around 10 ids per query
RETURN
c.companyId as Id,
case r.branchType
when 0 then 25
... // //around 7 conditions per query
when 10 then 20
end as Score
I have to score like this all relations types for :Company, group by Id
, sum Score
, order and take top 100 results.
Because of lack of post union processing, I am using collect
+ unwind
in order to merge scores from all relations.
Unfortunately, the performance is low. I get response for query of one relations (like above) in 5-10 seconds. When I am trying to combine results with collect
+ unwind
, the query "never" ends.
What is better/proper way to do it? Maybe I am doing something wrong with graph design? Hardware configuration to low? Or maybe there are some algorithms to match with score graph (query data) in graph database?
UPDATE
Query explanation:
User can search in our system for companies. For his query we prepare query data contains ids of branches, countries, words, etc. In result of query we want to get list of best matching companies ids with score.
E.g. user can search for new companies producing wooden tables from Spain.
Combined query example:
MATCH (c:Company)-[r:HAS_BRANCH]->(b:Branch)
WHERE b.branchId in ["27444" , "1692" , "23409" , "8744" , "9192" , "26591" , "21396" , "27151" , "20228" , "3517" , "25058" , "29549"]
WITH case r.branchType
when "0" then collect({id:c.companyId, score: 25})
when "1" then collect({id:c.companyId, score: 19})
when "2" then collect({id:c.companyId, score: 20})
when "3" then collect({id:c.companyId, score: 19})
when "4" then collect({id:c.companyId, score: 20})
when "5" then collect({id:c.companyId, score: 15})
when "6" then collect({id:c.companyId, score: 6})
when "7" then collect({id:c.companyId, score: 5})
when "8" then collect({id:c.companyId, score: 4})
when "9" then collect({id:c.companyId, score: 4})
when "10" then collect({id:c.companyId, score: 20})
end as rows
MATCH (c:Company)-[r:HAS_REVERTED_BRANCH]->(b:Branch)
WHERE b.branchId in ["27444" , "1692" , "23409" , "8744" , "9192" , "26591" , "21396" , "27151" , "20228" , "3517" , "25058" , "29549"]
WITH rows + case r.branchType
when "0" then collect({id:c.companyId, score: 25})
when "1" then collect({id:c.companyId, score: 19})
when "2" then collect({id:c.companyId, score: 20})
when "3" then collect({id:c.companyId, score: 19})
when "10" then collect({id:c.companyId, score: 20})
end as rows
MATCH (c:Company)-[r:HAS_COUNTRY]->(cou:Country)
WHERE cou.countryId in ["9580" , "18551" , "15895"]
WITH rows + case r.branchType
when "0" then collect({id:c.companyId, score: 30})
when "2" then collect({id:c.companyId, score: 15})
end as rows
... //here I would add in future other relations scoring
UNWIND rows AS row
RETURN row.id AS Id, sum(row.score) AS Score
ORDER BY Score DESC
LIMIT 100
HAS_BRANCH
relationship is important for your business, and generally if it's important, it should be a node. – logisima