Optimizing MySQL table for selecting many rows in date range

Question

I have an InnoDB table in MySQL where I have to select and sum a lot of data in date ranges. It seems I can't get to a point where it runs fast enough for the use case.

The table is as follows:
user_id: int
date: date
amount: int

The table has several hundred million rows.
A date range can return up to 10 million rows.
Amount is 1-10

I have a composite index on all three columns in the order: user_id, date, amount.

The query I use for selecting is:

SELECT   
    SUM(amount)  
FROM table  
WHERE user_id = ?  
AND request_date <= ?  
AND request_date >= ?

I hardcode the dates into the query.

Anything else I can do to speed up this query? I should be able to do the query about 20 times a second.

It's running on DI with 8gb RAM and 4 CPUs (not dedicated).

Update
The output of EXPLAIN is:

select_type: SIMPLE  
type: range
possible_keys: composite  
key: composite  
key_len: 7 
ref: null 
rows: 14994440  
Extra: Using where; Using index

Can you please post the output from EXAPLIN? (pre-pend the word EXPLAIN to the query above) — Dave Stokes
Partition the table. Are date ranges you search for spanning years, decades? what? You could go as far to partition by user and year if needed. That way the hundres of millions of rows are far fewer. — xQbert
The range is 30 days. But 30 days running from current date. So I can’t partition for specific months. There can still be up to 10 million rows per user for a 30 period, times a few hundred users. — Stromgren
@xQbert - He still has to touch all the rows for that user in that date range. So no benefit to partitioning. — Rick James

Wouter van Nifterick Wouter van Nifterick · Accepted Answer · 2019-09-09T22:50:05

I've used various techniques in the past to do similar stuff.

You should consider partitioning your table. That involves creating a column that contains a partition identifier, which could be a date, or year-month
I've had some performance increase by splitting the date and time portion. The advantage is that you can then quickly grab all data from a date by looking at the date field, without even considering the time portion.
If you know what kind of data you'll be requesting, and you can allow for some delays, you can pre-calculate. It looks like you're working with log-data, so I assume that query results for anything that's older than today will never change. You should exploit that, for example by having a separate table with aggregated data. If you only need to calculate "today" things will be much faster. Or accept that numbers are a bit old, you can just pre-calculate periodically.

The table that I'm talking about could be something like:

CREATE table aggregated_requests AS
SELECT user_id, request_date, SUM(amount) as amount
FROM table

After that, rewrite your query above like this, and i'll be extremely fast:

SELECT SUM(amount)  
FROM aggregated_requests  
WHERE user_id = ?  
AND request_date <= ?  
AND request_date >= ?

Optimizing MySQL table for selecting many rows in date range

2 Answers