0
votes

I have an InnoDB table in MySQL where I have to select and sum a lot of data in date ranges. It seems I can't get to a point where it runs fast enough for the use case.

The table is as follows:
user_id: int
date: date
amount: int

The table has several hundred million rows.
A date range can return up to 10 million rows.
Amount is 1-10

I have a composite index on all three columns in the order: user_id, date, amount.

The query I use for selecting is:

SELECT   
    SUM(amount)  
FROM table  
WHERE user_id = ?  
AND request_date <= ?  
AND request_date >= ?

I hardcode the dates into the query.

Anything else I can do to speed up this query? I should be able to do the query about 20 times a second.

It's running on DI with 8gb RAM and 4 CPUs (not dedicated).

Update
The output of EXPLAIN is:

select_type: SIMPLE  
type: range
possible_keys: composite  
key: composite  
key_len: 7 
ref: null 
rows: 14994440  
Extra: Using where; Using index
2
Can you please post the output from EXAPLIN? (pre-pend the word EXPLAIN to the query above)Dave Stokes
Added update with the output of EXPLAINStromgren
Partition the table. Are date ranges you search for spanning years, decades? what? You could go as far to partition by user and year if needed. That way the hundres of millions of rows are far fewer.xQbert
The range is 30 days. But 30 days running from current date. So I can’t partition for specific months. There can still be up to 10 million rows per user for a 30 period, times a few hundred users.Stromgren
@xQbert - He still has to touch all the rows for that user in that date range. So no benefit to partitioning.Rick James

2 Answers

2
votes

I've used various techniques in the past to do similar stuff.

  • You should consider partitioning your table. That involves creating a column that contains a partition identifier, which could be a date, or year-month

  • I've had some performance increase by splitting the date and time portion. The advantage is that you can then quickly grab all data from a date by looking at the date field, without even considering the time portion.

  • If you know what kind of data you'll be requesting, and you can allow for some delays, you can pre-calculate. It looks like you're working with log-data, so I assume that query results for anything that's older than today will never change. You should exploit that, for example by having a separate table with aggregated data. If you only need to calculate "today" things will be much faster. Or accept that numbers are a bit old, you can just pre-calculate periodically.

The table that I'm talking about could be something like:

CREATE table aggregated_requests AS
SELECT user_id, request_date, SUM(amount) as amount
FROM table  

After that, rewrite your query above like this, and i'll be extremely fast:

SELECT SUM(amount)  
FROM aggregated_requests  
WHERE user_id = ?  
AND request_date <= ?  
AND request_date >= ?
1
votes

Plan A: INDEX(user_id, request_date, amount) -- optimal for the WHERE, also "covering". OK, you have that; so, on to plan B:

Plan B (even better): Build and maintain a Summary table of, say, daily subtotals. Then query that table instead. More: http://mysql.rjweb.org/doc.php/summarytables

Partitioning is unlikely to help more than a good index (as in Plan A).

More on B

If you need up-to-the-minute totals, there are multiple approaches to achieve it using summary tables without waiting until the next day.

  • IODKU against the summary table at the same time (possibly in a Trigger) that you insert the row data. This keeps the summary table up to date, but with non-trivial overhead.
  • Hybrid. Reach into the summary table for whole days, then total up 'today' from the raw data and add it on.
  • Summarize by hour instead of by day. This either gives you only hourly resolution, or you can combine with the "hybrid" to make that run faster.

(My blog gives those 3, plus 3 more.)

Other

"Amount is 1-10" -- I hope you are using a 1-byte TINYINT, not a 4-byte INT. That's 300MB of difference. Perhaps user_id could be smaller than INT.