2
votes

I have two tables. In the table UnRevisedTable goes the raw data and in the table RevisedTable there is only entries of revised values which a human inserts when he finds a value in UnRevisedTable than he doesnt likes.

In UnRevisedTable there is one entry every minute and in RevisedTable there is one entry for every unwanted entry in UnRevisedTable.

The purpose of those two queries is the same: Show me the union of revised and unrevised data replacing the unrevised data when there is revised data.

I started writing the query2 but it was too slow. Then i wrote the query1 and the query1 is much, much faster.

My question is why query1 is faster than query2? Thx.

query1:
    SELECT o.start_date_time,
        CASE WHEN r.start_date_time IS NULL THEN o.value ELSE r.value END AS value,
        FROM UnRevisedTable o LEFT JOIN RevisedTable r ON o.start_date_time = r.start_date_time
        WHERE o.start_date_time >= '".$start."' AND o.start_date_time < '".$finish."' ORDER BY start_date_time ASC;

query2:
    select * from(
      select RevisedTable.* from RevisedTable where start_date_time between '".$start."' and '".$finish."' 
          union 
      select UnRevisedTable.* from UnRevisedTable where start_date_time between '".$start."' and '".$finish."'
    ) as t1 group by start_date_time;

As Abhik Chakraborty sugested here are the explain query1 and explain query2:

query1:

+----+-------------+-------+--------+---------------+---------+---------+------------------------+------+-------------+
| id | select_type | table | type   | possible_keys | key     | key_len | ref                    | rows | Extra       |
+----+-------------+-------+--------+---------------+---------+---------+------------------------+------+-------------+
|  1 | SIMPLE      | o     | range  | PRIMARY       | PRIMARY | 8       | NULL                   |    9 | Using where |
|  1 | SIMPLE      | r     | eq_ref | PRIMARY       | PRIMARY | 8       | Mydb.o.start_date_time |    1 |             |
+----+-------------+-------+--------+---------------+---------+---------+------------------------+------+-------------+

query2:

+----+--------------+------------+-------+---------------+---------+---------+------+------+---------------------------------+    
| id | select_type  | table      | type  | possible_keys | key     | key_len | ref  | rows | Extra                           |
+----+--------------+------------+-------+---------------+---------+---------+------+------+---------------------------------+
|  1 | PRIMARY      | <derived2> | ALL   | NULL          | NULL    | NULL    | NULL |   14 | Using temporary; Using filesort |
|  2 | DERIVED      | RevisedTable  | range | PRIMARY    | PRIMARY | 8       | NULL |    2 | Using where                     |
|  3 | UNION        | UnRevisedTable| range | PRIMARY    | PRIMARY | 8       | NULL |   10 | Using where                     |
| NULL | UNION RESULT | <union2,3> | ALL   | NULL        | NULL    | NULL    | NULL | NULL |                                 |
+----+--------------+------------+-------+---------------+---------+---------+------+------+---------------------------------+
2
use explain first query and then explain 2nd query and share the result in the question. - Abhik Chakraborty

2 Answers

2
votes

First, the two queries do not do the same thing.

  • The first query returns only times from UnrevisedTable. If there are times in the RevisedTable not in Unrevised, you won't get them.
  • The first query is using a join. And, if you have an index on the join key, start_date_time, then the index will be used. Aggregation is less likely to use indexes.
  • The first query will not remove duplicates that might exist within each table. The second does.
  • The two queries will not necessarily return the same value when there is a match in the two tables.

The second query is doing two aggregations, but no joins. The first aggregation is for removing duplicate values (for the union) and the second for the outer group by.

I would start by revising the second query to be:

select start_date_time, max(value) as value
from ((select start_date_time, value
       from RevisedTable
       where start_date_time between '".$start."' and '".$finish."' 
      ) union all
      (select start_date_time, value
       from UnRevisedTable
       where start_date_time between '".$start."' and '".$finish."'
      )
     ) t1
 group by start_date_time

You will probably also find that this is slower than the join. The MySQL engine does a better job of implementing joins than aggregations.

1
votes

To answer you question the slower query uses a UNION, which removes rows which are duplicated between the first and second tables. This will require a sort which is often expensive. You can see that in query 2's plan as a filesort. You can eliminate this sort by using a UNION ALL instead.

You may like to consider replacing the CASE statement in query 1 with COALESCE which returns the first non-null. It will make for an easier-read query and may run faster.

SELECT 
    o.start_date_time,
    COALESCE (o.value, r.value) AS value
FROM UnRevisedTable o LEFT JOIN RevisedTable r ON o.start_date_time = r.start_date_time
WHERE o.start_date_time >= '".$start."' 
AND o.start_date_time < '".$finish."' 
ORDER BY start_date_time ASC;