mysql aggregate functions in query with two joins gives unexpected results

Question

Given the following (very simplified) mysql table structure:

products

id

product_categories

id
product_id
status (integer)

product_tags

id
product_id
some_other_numeric_value

I am trying to find every product that has an association to a certain product_tag, and that a relation to at least one category whichs status-attribute is 1.

I tried the following query:

SELECT *

FROM `product` p

JOIN `product_categories` pc
ON p.`product_id` = pc.`product_id`

JOIN `product_tags` pt
ON p.`product_id` = pt.`product_id`

WHERE pt.`some_value` = 'some comparison value'

GROUP BY p.`product_id`

HAVING SUM( pc.`status` ) > 0

ORDER BY SUM( pt.`some_other_numeric_value` ) DESC

Now my problem is: The SUM(pt.some_other_numeric_value) returns unexpected values.

I realized that if the product in question has more then one relation to the product_categories table, then every relation to the product_tags table is counted as many timed as there are relations to the product_categories table!

For example: If product with id=1 has a relation to product_categories with ids = 2, 3 and 4, and a relation with the product_tags with ids 5 and 6 - then if I insert a GROUP_CONCAT(pt.id), then it does give 5,6,5,6,5,6 instead of the expected 5,6.

At first I suspected it was a problem with the join type (left join, right join, inner join, and so on), so I tried every join type that I know of, but to no avail. I also tried to include more id-fields into the GROUP BY clause, but this didn´t solve the problem either.

Can somebody explain to me what is actually going wrong here?

for questions like this, it is much better to add sample records with desired result. — John Woo
yes better give eexemple of your created tables and desired results — echo_Me

ypercubeᵀᴹ ypercubeᵀᴹ · Accepted Answer · 2013-01-03T14:30:43

You join a "main" (product) table to two tables (tags and categories) via 1:n relationships, so this is expected, you are creating a mini cartesian product. For those products that have both more than one associated tags and more than one associated categories, multiple rows are created in the result set. If you Group By, you have wrong results in aggregate functions.

One way to avoid this is to remove one of the two joins, which is a valid startegy if you don't need results from that table. Say you don't need anything in the SELECT list from the product_categories table. Then you can use a semi-join (the EXISTS subquery)to that table:

SELECT p.*,
       SUM( pt.`some_other_numeric_value` )

FROM `product` p

JOIN `product_tags` pt
  ON p.`product_id` = pt.`product_id`

WHERE pt.`some_value` = 'some comparison value'

  AND EXISTS
      ( SELECT *
        FROM product_categories pc
        WHERE pc.product_id = pc.product_id
         AND  pc.status = 1
      ) 

GROUP BY p.`product_id`

ORDER BY SUM( pt.`some_other_numeric_value` ) DESC ;

Another way to circumvent this problem is - after the GROUP BY MainTable.pk - to use DISTINCT inside the COUNT() or GROUP_CONCAT() aggregate functions. This works but you can't use it with SUM(). So, it's not useful in your specific query.

A third option - which works always - is to first group by the two (or more) side tables and then join to the main table. Something like this in your case:

SELECT p.* ,
       COALESCE(pt.sum_other_values, 0) AS sum_other_values
       COALESCE(pt.cnt, 0) AS tags_count,
       COALESCE(pc.cnt, 0) AS categories_count,
       COALESCE(category_titles, '') AS category_titles

FROM `product` p

JOIN 
    ( SELECT product_id
           , COUNT(*) AS cnt
           , GROUP_CONCAT(title) AS category_titles
      FROM `product_categories` pc
      WHERE status = 1
      GROUP BY product_id
    ) AS pc
  ON p.`product_id` = pc.`product_id`

JOIN 
    ( SELECT product_id
           , COUNT(*) AS cnt
           , SUM(some_other_numeric_value) AS sum_other_values
      FROM `product_tags` pt
      WHERE some_value = 'some comparison value'
      GROUP BY product_id
    ) AS pt
ON p.`product_id` = pt.`product_id`

ORDER BY sum_other_values DESC ;

The COALESCE() are not strictly needed there - just in case you chnage the inner joins to LEFT outer joins.

mysql aggregate functions in query with two joins gives unexpected results

2 Answers