Why can't MySQL optimize this query?

Question

I have a query that is giving me problems and I can't understand why MySQL's query optimizer is behaving the way it is. Here is the background info:

I have 3 tables. Two are relatively small and one is large.

Table 1 (very small, 727 rows):

CREATE TABLE ipa (
ipa_id int(11) NOT NULL AUTO_INCREMENT,
ipa_code int(11) DEFAULT NULL,
ipa_name varchar(100) DEFAULT NULL,
payorcode varchar(2) DEFAULT NULL,
compid int(11) DEFAULT '2'
PRIMARY KEY (ipa_id),
KEY ipa_code (ipa_code) ) ENGINE=MyISAM

Table 2 (smallish, 59455 rows):

CREATE TABLE assign_ipa (
assignid int(11) NOT NULL AUTO_INCREMENT,
ipa_id int(11) NOT NULL,
userid int(11) NOT NULL,
username varchar(20) DEFAULT NULL,
compid int(11) DEFAULT NULL,
PayorCode char(10) DEFAULT NULL
PRIMARY KEY (assignid),
UNIQUE KEY assignid (assignid,ipa_id),
KEY ipa_id (ipa_id)
) ENGINE=MyISAM

Table 3 (large, 24,711,730 rows):

CREATE TABLE master_final (
IPA int(11) DEFAULT NULL,
MbrCt smallint(6) DEFAULT '0',
PayorCode varchar(4) DEFAULT 'WC',
KEY idx_IPA (IPA)
) ENGINE=MyISAM DEFAULT

Now for the query. I'm doing a 3-way join using the first two smaller tables to essentially subset the big table on one of it's indexed values. Basically, I get a list of IDs for a user, SJOnes and query the big file for those IDs.

mysql> explain
SELECT master_final.PayorCode, sum(master_final.Mbrct) AS MbrCt
FROM master_final
INNER JOIN ipa ON ipa.ipa_code = master_final.IPA
INNER JOIN assign_ipa ON ipa.ipa_id = assign_ipa.ipa_id
WHERE assign_ipa.username = 'SJones'
GROUP BY master_final.PayorCode, master_final.ipa\G;
************* 1. row *************
id: 1
select_type: SIMPLE
table: master_final
type: ALL
possible_keys: idx_IPA
key: NULL
key_len: NULL
ref: NULL
rows: 24711730
Extra: Using temporary; Using filesort
************* 2. row *************
id: 1
select_type: SIMPLE
table: ipa
type: ref
possible_keys: PRIMARY,ipa_code
key: ipa_code
key_len: 5
ref: wc_test.master_final.IPA
rows: 1
Extra: Using where
************* 3. row *************
id: 1
select_type: SIMPLE
table: assign_ipa
type: ref
possible_keys: ipa_id
key: ipa_id
key_len: 4
ref: wc_test.ipa.ipa_id
rows: 37
Extra: Using where
3 rows in set (0.00 sec)

This query takes forever (like 30 minutes!). The explain statement tells me why, it's doing a full table scan on the big table even though there is a perfectly good index. It's not using it. I don't understand this. I can look at the query and see that it's only needs to query a couple of IDs from the big table. If I can do it, why can't MySQL's optimizer do it?

To illustrate, here are the IDs associated with 'SJones':

mysql> select username, ipa_id from assign_ipa where username='SJones';
+----------+--------+
| username | ipa_id |
+----------+--------+
| SJones | 688 |
| SJones | 689 |
+----------+--------+
2 rows in set (0.02 sec)

Now, I can rewrite the query substituting the ipa_id values for the username in the where clause. To me this is equivalent to the original query. MySQL sees it differently. If I do this, the optimizer makes use of the index on the big table.

mysql> explain
SELECT master_final.PayorCode, sum(master_final.Mbrct) AS MbrCt
FROM master_final
INNER JOIN ipa ON ipa.ipa_code = master_final.IPA
INNER JOIN assign_ipa ON ipa.ipa_id = assign_ipa.ipa_id
*WHERE assign_ipa.ipa_id in ('688','689')*
GROUP BY master_final.PayorCode, master_final.ipa\G;
************* 1. row *************
id: 1
select_type: SIMPLE
table: ipa
type: range
possible_keys: PRIMARY,ipa_code
key: PRIMARY
key_len: 4
ref: NULL
rows: 2
Extra: Using where; Using temporary; Using filesort
************* 2. row *************
id: 1
select_type: SIMPLE
table: assign_ipa
type: ref
possible_keys: ipa_id
key: ipa_id
key_len: 4
ref: wc_test.ipa.ipa_id
rows: 37
Extra: Using where
************* 3. row *************
id: 1
select_type: SIMPLE
table: master_final
type: ref
possible_keys: idx_IPA
key: idx_IPA
key_len: 5
ref: wc_test.ipa.ipa_code
rows: 34953
Extra: Using where
3 rows in set (0.00 sec)

The only thing I've changed is a where clause that doesn't even directly hit the big table. And yet, the optimizer uses the index 'idx_IPA' on the big table and the full table scan is no longer used. The query when re-written like this is very fast.

OK, that's a lot of background. Now my question. Why should the where clause matter to the optimizer? Either where clause will return the same result set from the smaller table, and yet I'm getting dramatically different results depending on which one I use. Obviously, I want to use the where clause containing the username rather than trying to pass all associated IDs to the query. As written though, this is not possible?

Can someone explain why this is happening?
How might I rewrite my query to avoid the full table scan?

Thanks for sticking with me. I know its a very longish question.

I read an article by one of the MySQL developers (some time ago) that the optimizer was still a work in progress - and then they were absorbed by Oracle. Have you tried using "hints" or perhaps moving the "assign_ipa.username = 'SJones'" to the JOIN? — ron tornambe

Gerd Riesselmann Gerd Riesselmann · Accepted Answer · 2012-03-01T19:42:03

Not quite sure if I'm right, but I think the following is happening here. This:

WHERE assign_ipa.username = 'SJones'

may create a temporary table, since it requires a full table scan. Temporary tables have no indexes, and they tend to slow down things down a lot.

The second case

INNER JOIN ipa ON ipa.ipa_code = master_final.IPA
INNER JOIN assign_ipa ON ipa.ipa_id = assign_ipa.ipa_id
WHERE assign_ipa.ipa_id in ('688','689')

on the other hand allows for joining of indexes, which is fast. Additionally, it can be transformed to

SELECT .... FROM master_final WHERE IDA IN (688, 689) ...

and I think MySQL is doing that, too.

Creating an index on assign_ipa.username may help.

Edit

I rethought the problem and now have a different explanation.

The reason of course is the missing index. This means that MySQL has no clue how large the result of querying assign_ipa would be (MySQL does not store counts), so it starts with the joins first, where it can relay on keys.

That's what row 2 and 3 of explain log tell us.

And after that, it tries to filter the result by assign_ipa.username, which has no key, as stated in row 1.

As soon as there is an index, it filters assign_ipa first, and joins afterwards, using the according indexes.

Why can't MySQL optimize this query?

2 Answers