weighted sums in neo4j queries

Question

Let's say you have a neo4j graph that has 100,000 "color" nodes and 50,000 "painting" nodes. Each painting node has a "contains" relationship with 50 to 100 of the colors. Lets also say you 200 "aggregate color" nodes each with a relationship with ~ 1000 colors. The aggregate color nodes contain a scalar weight. Finally, you create a "palette" node with a relationship with 10 - 20 aggregate colors.

I want a node4j cypher query that identifies the top 10 painting with the highest weighted sum of aggregated colors based on the colors in the painting.

Let

c represent a color node
a represent a aggregate color node
p represent a painting
l represent a palette

So

(p)-[:contains]->(c)
(a)-[:aggregates]->(c)
(l)-[:uses]->(a)

Supposing I have a palette call "MY_PALETTE", this query will tell me the top 10 paintings in terms of the number of matches of unique aggregated colors.

MATCH (l)-[:uses]->(a)-[:contains]->(c) WHERE l.name = 'MY_PALETTE'
WITH a MATCH (p)-[:contains]->(c), (a)-[:aggregates]->(c)
WITH p, a RETURN p.name, COUNT(DISTINCT a) ORDER BY COUNT(DISTINCT a)
DESC LIMIT 10;

I want the top paintings in terms of the weighted sum. If all the weights were 1, this would give the correct answer.

It seems I can't inspect a in the RETURN clause.

Note that I want to count each aggregated color only once even if the painting contains several colors in the aggregated color.

I want to be able to add new palettes and only have to add relationships between palette and aggregated color.

Any suggestions?

If a painting uses, say, 3 colors from the same aggregate, do you want to use the aggregate's weight once or 3 times in the sum? — cybersam

cybersam cybersam · Accepted Answer · 2018-03-21T22:43:05

Your query has a couple of bugs. The first MATCH should be looking for an aggregates relationship (not a contains relationship) between a and c. And the first WITH clause is not passing the c value forward.

Here is a simplified and more efficient form of that query:

MATCH (l:Palette)-[:uses]->(a)-[:aggregates]->()<-[:contains]-(p)
WHERE l.name = 'MY_PALETTE'
WITH p, COUNT(DISTINCT a) AS aggregate_count
RETURN p.name AS palette_name, aggregate_count
ORDER BY aggregate_count DESC
LIMIT 10;

To get the top 10 paintings by total aggregate weight (using the weight of an aggregate at most once, even if multiple colors in that aggregate are used by a painting), you can do this:

MATCH (l:Palette)-[:uses]->(a)-[:aggregates]->()<-[:contains]-(p)
WHERE l.name = 'MY_PALETTE'
WITH DISTINCT p, a
RETURN p.name AS palette_name, SUM(a.weight) AS aggregate_weight_total
ORDER BY aggregate_weight_total DESC
LIMIT 10;

Note that if there are multiple paintings with the same name (e.g., "Still Life"), the above queries would return the same palette_name multiple times. To make that possibility less confusing, you may also want to return a unique property value (e.g., p.id) for each painting as well.

Also, you should consider creating an index (or a uniqueness constraint) on :Palette(name) to speed up the lookup of palettes by name.

weighted sums in neo4j queries

1 Answers