2
votes

I have time series data streaming in point by point, say every 5 seconds. And the points might arrive out of order. I want to aggregate in realtime up to higher timespans, say 5m, 30m, 60m. My primary concern is fast reads.

I'm interested in what techniques are common for performing this realtime aggregation. I'm thinking I'm going to need a long term store on disk, but for near realtime points I think I should be storing them in memory, to make it easier to aggregate.

Is the preferred way to store them in a memory cache (Redis) and then have a job that is triggered periodically that calculates the aggregate and flushes to disk? If so, what if I get point that arrives after the periodical job has run? Do I go back and throw away that point and calculate the period again?

I'm probably answering my own questions here, but I'm fishing for any alternatives out there.

Thanks in advance. Chris :-)

2

2 Answers

1
votes

A lot of tools does expect you to have ordered timestamp since the data structure assumes that.

There's always a trade-off. As I see it you have 2 options:

  1. Use commonly used TSDB - most of them assume your data is ordered. You will probably need to order your data. for this you will need to decided what's your Max time that an unordered sample should arrive.

  2. If you are can't loose any data you should look for tools that can be continuously update an existing data.

If you use StatsD to create the the streaming data, you can configure it to any flush interval you wish for.

If you are looking for a time series data structure for redis, I started to work on a module (its not tested in production yet, APIs might change). https://github.com/danni-m/redis-tsdb

1
votes

There are many options, which one to use will depend on how accurate you need the aggregate numbers.

If you don't need perfect counts, you can store them using HyperLogLog using the timestamp and other attributes as the key. This way if data comes in out of order it will not matter.

There are also a number of open source and commercial time series databases, like InfluxDB, Druid, etc. (Search google for "Time Series Database")