12
votes

I'm trying to optimize this query:

SELECT articles.id 
FROM articles 
INNER JOIN articles_authors ON articles.id=articles_authors.fk_Articles 
WHERE articles_authors.fk_Authors=586 
ORDER BY articles.publicationDate LIMIT 0,50;

Table articles :

  • Engine : MyISAM
  • Row_format: Dynamic
  • Rows : 1 482 588
  • Data_length : 788 926 672
  • Max data length : 281 474 976 710 655
  • Index length : 127 300 608
  • data free : 0
  • checksum : null
    CREATE TABLE `articles` (
      `id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
    `title` VARCHAR(255) NOT NULL,
    `publicationDate` DATE NOT NULL DEFAULT '1970-01-01',
    PRIMARY KEY (`id`),
    KEY `publicationDate` (`publicationDate`)
    ) ENGINE=MYISAM AUTO_INCREMENT=1498496 DEFAULT CHARSET=utf8 

Table articles_authors :

  • Engine : MyISAM
  • Row_format: Dynamic
  • Rows : 1 970 750
  • Data_length : 45 008 420
  • Max data length : 281 474 976 710 655
  • Index length : 127 300 608
  • data free : 0
  • checksum : null
    CREATE TABLE `articles_authors` (
    `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
    `fk_Articles` int(10) unsigned NOT NULL,
    `fk_Authors` int(10) unsigned NOT NULL,
    PRIMARY KEY (`id`),
    UNIQUE KEY `fk_Articles_fk_Authors` (`fk_Articles`,`fk_Authors`),
    KEY `fk_Articles` (`fk_Articles`),
    KEY `fk_Authors` (`fk_Authors`),
    ) ENGINE=MyISAM AUTO_INCREMENT=2349047 DEFAULT CHARSET=utf8 

Explain on query :

id (1), select_type(SIMPLE), TABLE(articles_authors), TYPE(ref), possible_keys(fk_Articles_fk_Authors, fk_Articles, fk_Authors), KEY (fk_Authors), Key_len(4), ref(const), ROWS(171568), extra (USING TEMPORARY; USING FILE sort)
id (1), select_type(SIMPLE), TABLE(articles), TYPE(eq_ref), possible_keys(PRIMARY), KEY (PRIMARY), Key_len(4), ref(articles_authors.fk_Authors), ROWS(1), extra ()

As you can see, the SQL query is not optimized (using file sort in explain).

Thanks for your help!

5
+1, for a well documented question! Love it when people actually include pertinent info!Abe Miessler
I don't see how this can be optimized more, because in the where/order clause you have values from two different tables and you can't create a composite index (fk_Authors,publicationDate)Pentium10
Edited the answer to include a de-normalization option.Johan

5 Answers

1
votes

Maybe this will help you:

SELECT articles.id 
    FROM articles 
        INNER JOIN (SELECT fk_Articles FROM articles_authors WHERE articles_authors.fk_Authors=586) sub ON articles.id=sub.fk_Articles 
ORDER BY articles.publicationDate LIMIT 0,50;
1
votes

It is using the index, like it says in the explain.

id (1), select_type(SIMPLE), TABLE(articles_authors), TYPE(ref),  
 possible_keys(fk_Articles_fk_Authors, fk_Articles, fk_Authors),`   
`KEY (fk_Authors), Key_len(4)`, ref(const), ROWS(171568),  
extra (USING TEMPORARY; USING FILE sort)

Only as an extra for the 50 rows that it selects and than orders by publication date does it do a filesort.
It creates a temporary table with 50 items. Which it then sorts with tablesort.
This has to be done this way, because MySQL cannot use the big index on those lonely 50 items, it would cost to much in IO-access time.

It's faster to to a sort on 50 numbers in memory then to access the index on disk.

You can do something to speed up the query though:

optimize table articles, articles_authors

and rerun the query.

EDIT: Speed up suggestion by denormalizing table articles

If you rewrite the query like this:

SELECT articles.id FROM articles WHERE articles.id IN (
  SELECT articles_authors.fk_articles WHERE articles_authors.fk_authors = 586 
  LIMIT 0,50
)
ORDER BY articles.publicationDate;

You will probably see the same performance, but it highlights the problem. If author 586 has 180,000 articles, then MySQL has to search 50 items out of 180k in articles_authors and then search 50 items out of 180k again in the order table.

If you merge the tables article_authors and articles, your table articles will be denormalized (assuming an article can have multiple authors) but you don't have to do the join and you save yourself the 2nd search.

CREATE TABLE `articles` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `publicationDate` date NOT NULL DEFAULT '1970-01-01',
  `title` varchar(255) NOT NULL,
  `fk_Authors` int(10) unsigned NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `Articles_fk_Authors` (`id`,`fk_Authors`),
KEY `fk_Authors` (`fk_Authors`),
KEY `publicationDate` (`publicationDate`)
) ENGINE=MyISAM AUTO_INCREMENT=2349047 DEFAULT CHARSET=utf8 

Now you can select from it like so

SELECT articles.id FROM articles WHERE articles.Author = 586 
ORDER BY articles.publicationDate LIMIT 50,0
0
votes
SELECT articles.id 
FROM articles 
INNER JOIN articles_authors ON articles.id=articles_authors.fk_Articles 
WHERE articles.id=586 
ORDER BY articles.publicationDate LIMIT 0,50;
0
votes

Not sure, but Conrad's suggestion seems to change the sorting and limiting, so you might get the first 50 items of a random list in sorted order instead of the first 50 items of a sorted list.

Could a view with the join help, if it is ordered by fk_author, publicationDate and has an index? Also depends for what you are optimizing, speed or disk space?

Can you use IN in Mysql? Does it maybe optimize better? (example code, not checked)

SELECT id FROM articles WHERE id IN 
(SELECT fk_Articles FROM articles_authors WHERE fk_Authors=586) as IDs
ORDER BY publicationDate LIMIT 0,50;
0
votes

This can actually be valid, depending on your data.

SELECT articles.id 
FROM articles 
INNER JOIN articles_authors ON articles.id=articles_authors.fk_Articles 
WHERE articles_authors.fk_Authors=586 
ORDER BY articles.publicationDate LIMIT 0,50;

If articles_authors.fk_Authors=586 leads to reasonably rare rows according to the stats collected by your DB engine, it'll be cheaper to fetch the all and fetch the top 50 rows.

If, by contrast, it leads to most of the articles, it'll be cheaper to consult an index on articles.publicationDate and filter out invalid rows until you've the requested 50 rows.