2
votes

I have a query:

select SQL_NO_CACHE id from users
 where id>1 and id <1000
   and id in  ( select owner_id from comments and content_type='Some_string');

(note that it is short of an actual large query used for my sphinx indexing, representing the problem) This query is taking about 3.5 seconds(modifying range from id = 1..5000 makes it about 15 secs).

users table has about 35000 entries and comments table has about 8000 entries.

Explain on above query:

explain select SQL_NO_CACHE id from users
        where id>1 and id <1000
          and id in  ( select distinct owner_id from d360_core_comments);

| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |


| 1 | PRIMARY | users | range | PRIMARY | PRIMARY | 4 | NULL | 1992 | Using where; Using index |

| 2 | DEPENDENT SUBQUERY | d360_core_comments | ALL | NULL | NULL | NULL | NULL | 6901 | Using where; Using temporary |

where the individual subquery(select owner_id from d360_core_comments where content_type='Community20::Topic';) here is taking almost 0.0 seconds.

However if I add index on owner_id,content_type, (note the order here)

create index tmp_user on d360_core_comments (owner_id,content_type);

My subquery runs as is in ~0.0 seconds with NO index used:

mysql> explain select owner_id from d360_core_comments where content_type='Community20::Topic';

| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |


| 1 | SIMPLE | d360_core_comments | ALL | NULL | NULL | NULL | NULL | 6901 | Using where |

However now my main query (select SQL_NO_CACHE id from users where id>1 and id <1000 and id in ( select owner_id from d360_core_comments where content_type='Community20::Topic');) now runs in ~0 seconds with following explain:

mysql> explain select SQL_NO_CACHE id from users where id>1 and id <1000 and id in ( select owner_id from d360_core_comments where content_type='Community20::Topic');

| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |


| 1 | PRIMARY | users | range | PRIMARY | PRIMARY | 4 | NULL | 1992 | Using where; Using index |

| 2 | DEPENDENT SUBQUERY | d360_core_comments | index_subquery | tmp_user | tmp_user | 5 | func | 34 | Using where |

So the main questions I have are:

  • If the index defined on the table used in my subquery is not getting used in my actual subquery then how it is optimizing the query here?
  • And why in the first place the first query was taking so much time when the actual subquery and main query independently are much faster?
4
Code should be indented with four spaces.tadman
@tadman I like to use <pre></pre> for output.Kermit
Stack Overflow uses four spaces to identify blocks of code. Quoting things like you've done here makes it much harder to read.tadman

4 Answers

2
votes

If you read this section of the MySQL Reference Manual: Optimizing Subqueries with EXISTS Strategy, you'll see that the query optimizer transforms your subquery condition from:

id in ( select distinct owner_id
          from d360_core_comments
         where content_type='Community20::Topic')

into:

exists ( select 1
           from d360_core_comments
          where content_type='Community20::Topic'
            and owner_id = users.id )

This is why a index on (owner_id, content_type) is not useful when the subquery is tested as standalone query, but it is useful when considering the transformed subquery.

3
votes

What seems to happen in full query without the index is that MySQL will build (some sort of) temporary table of all the owner_id that the subquery generates. Then for each row from the users table that matches the id constraint, a lookup in this temporary construct will be performed. It is unclear if the overhead is creating the temporary construct, or if the lookup is implemented suboptimally (so that all elements are linearly matched for each row from the outer query.

When you create the index on owner_id, this doesn't change anything when you run only the subquery, because it has no condition on owner_id, nor does the index cover the content_type column.

However, when you run the full query with the index, there is more information available, since we now have values coming from the outer query that should be matched to owner_id, which is covered by the index. So the execution now seems to be to run the first part of the outer query, and for each matching row do an index lookup by owner_id. In other words, a possible execution plan is:

From Index-Users-Id Get all id matching id>1 and id <1000
For Each Row
    Include Row If Index-Comment-OwnerId Contains row.Id
                   And Row Matches content_type='Some_string'

So in this case, the work to run 1000 (I assume) index lookups is faster than building a temporary construct of the 8000 possible owner_id. But this is only a hypothesis, since I don't know MySQL very well.

1
votes

The first thing you should know is that MySQL can not optimize dependent subqueries, it is a for a long time well-known MySQL deficiency, that is going to be fixed in MySQL 6.x (just google for "mysql dependent subquery" and you will see). That is the subquery is basically executed for each matching row in users table. Since you have an additional condition, the overall execution time depends on that condition. The solution is to substitute the subquery with a join (the very optimization that you expect from MySQL under the hood).

Second, there is a syntax error in your subquery, and I think there was a condition on owner_id. Thus, when you add an index on owner_id it is used, but is not enough for the second condition (hence no using index), but why is not mentioned in EXPLAIN at all is a question (I think because of the condition on the users.id)

Third, I do not know why you need that id > 1 and id < 5000 condition, but you should understand that these are two range conditions that require very accurate, sometimes non-obvious and data-dependent indexing approach (as opposed to equality comparison conditions), and if you actually do not need them and use only to undestand why the query takes so long, then it was a bad idea and they would shed no light.

In case, the conditions are required and the index on owner_id is still there, I would rewrite the query as follows:

SELECT id 
FROM (
  SELECT owner_id as id
  FROM comments
  WHERE owner_id < 5000 AND content_type = 'some_string'
) as ids
JOIN users ON (id)
WHERE id > 1;

P.S. A composite index on (content_type, owner_id) will even be better for the query.

0
votes

Step 1: Use id BETWEEN x AND y instead of id >= x AND id <= y. You may find some surprising gains because it indexes better.

Step 2: Adjust your sub-SELECT to do the filtering so it doesn't have to be done twice:

SELECT SQL_NO_CACHE id 
  FROM users
 WHERE id IN (SELECT owner_id 
                FROM comments
               WHERE content_type='Some_string' 
                 AND owner_id BETWEEN 1 AND 1000);

There seems to be several errors in your statement. You're selecting 2 through 999 for instance, presumably off by one on both ends, and the subselect wasn't valid.