0
votes

Why are these two queries with the only difference being the campaign_id (a foreign key to another table) getting different performance and different EXPLAIN results?

Query 1 - Avg time: 0.21s

SELECT tx_time, campaign_id, tx_amount, tx_status FROM tx WHERE
 campaign_id=6963 ORDER BY tx_time DESC LIMIT 2500;

Query 2 - Avg time: 0.29s

 SELECT tx_time, campaign_id, tx_amount, tx_status FROM tx WHERE
 campaign_id=6946 ORDER BY tx_time DESC LIMIT 2500;

Query 1 vs Query 2 EXPLAIN:

 id  select_type   table   partitions  type    possible_keys     key             key_len   ref   rows    filtered  Extra
 1   SIMPLE        tx      NULL        index   tx_campaign_id    tx_time         4         NULL  85591   2.92      Using where
 1   SIMPLE        tx      NULL        ref     tx_campaign_id    tx_campaign_id  4         const 106312  100       Using index condition; Using filesort

UPDATE: After adding (tx_id,tx_time,campaign_id) and (tx_id,tx_time) indexes and running ANALYZE, Query 1 has improved to 0.15s but Query 2 has slowed to 13s. Updated EXPLAINs:

 id  select_type   table   partitions  type    possible_keys     key             key_len   ref   rows    filtered  Extra
 1   SIMPLE        tx      NULL        index   tx_campaign_id    tx_time         4         NULL  75450   3.31      Using where
 1   SIMPLE        tx      NULL        ref     tx_campaign_id    tx_campaign_id  4         const 117400  100.00    Using index condition; Using filesort

Table tx:

CREATE TABLE tx (
tx_id bigint(20) unsigned NOT NULL AUTO_INCREMENT,
tx_time timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
campaign_id int(10) unsigned NOT NULL,
tx_amount decimal(12,5) unsigned NOT NULL,
tx_geo varchar(2) NOT NULL,
tx_langauge varchar(511) NOT NULL,
tx_ua varchar(511) NOT NULL,
tx_ip varchar(45) NOT NULL,
tx_status tinyint(255) DEFAULT NULL,
PRIMARY KEY (tx_id),
KEY tx_campaign_id (campaign_id),
KEY tx_time (tx_time) USING BTREE,
KEY tx_amount (tx_amount) USING BTREE,
KEY tx_time_campaign_id (tx_id,tx_time,campaign_id) USING BTREE,
KEY tx_id_time (tx_id,tx_time) USING BTREE,
CONSTRAINT campaign_idcampaign_id FOREIGN KEY (campaign_id) REFERENCES campaign (campaign_id) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB AUTO_INCREMENT=10855433 DEFAULT CHARSET=utf8

2
Probably because the statistics tell the query optimizer to use different indexesRobert Kock
@RobertKock is it because of the total possible rows 85,591 vs 106,312? Should I make any changes to optimize it further?ZAB
I'm not a DB expert. I left my comment only because I think the reason is somewhere in that direction. I leave the final answer to somebody with more expertise.Robert Kock
If you are worried about the 0.08s enough, you could alter your query to force the use of an index with an index hit, e.g. SELECT * FROM table1 USE INDEX (col1_index,col2_index) (see documentation).Cᴏʀʏ
The queries show one table; the EXPLAINs show two tables (actually a 'self-join'). Why the discrepancy??Rick James

2 Answers

1
votes

You need INDEX(campaign_id, tx_time) with the columns in that order.

In general, put the = column first, namely campaign_id. In this case, that takes care of the entire WHERE clause, so you can move on to the ORDER BY. Then add all the columns in the ORDER BY, namely tx_time.

Having successfully built an index that handles those, then the processing can stop at the LIMIT rows and avoid a 'filesort'.

Index Cookbook

0
votes

Without seeing your schema, it's hard to be sure, but I'm guessing it's because the optimizer is trying to figure out which index is more useful.

I assume you don't have a compound index (transaction_id, tx_time) on the table; if you had, the optimizer would probably use that (and be faster).

If you think about how the query would work, you can either first find all the records based on transaction ID, and then sort them based on time, or you can sort the records based on time, and discard the ones that don't belong to the transaction id you care about.

The first option (find all the matching transactions, then sort them) is fastest if you have lots of transaction IDs, and not that many time stamps. The second is fastest if you have lots of time stamps, and not that many transaction IDs. That's why the number of rows considered varies between query plans.

The best way to optimize this is to create the compound index, and to make sure you update the statistics the query optimizer uses.