Efficiently storing time series data: mySQL or flat files? Many tables (or files) or queries with WHERE condition?

Question

What's the best way to store time series data of thousands (but could become millions soon) real-world hardware sensors? The sensors itself are different, some just capture one variable, some up to a dozen. I need to store these values every hour, and I don't want to delete data that is older than x, i.e. the data will just keep growing.

Currently, I use a mySQL database to store these time series (which also serves a web frontend that shows nice time series graphs for every sensor). I have one table for every sensor, which right now equals about 11000 total. Each table has a layout like "timestamp, value1, [value2] ... ".

The main task of the database are more selects (every time sombebody looks at the graphs) than inserts/updates (once an hour). The select query for showing the graph is simply a "SELECT * FROM $sensor_id ORDER BY timestamp", so getting the info from my select statements is pretty simple/efficient.

However, having that many tables already presents some problems when backing up the database, because I run into LOCK limits (e.g. mysqldump: Got error: 23: Out of resources when opening file './database/table_xyz.MYD' (Errcode: 24) when using LOCK TABLES"). I can get around that error, but obviously that got me thinking...

So, the real question, broken down into sub-questions:

How bad is my approach of having one table for every sensor? What if instead of a few thousand tables, I had a few millions (I might have to deal with that many sensors in the near future)?
Is storing all sensors' data in one combined table with an extra column that holds the sensor_id a better approach, since it would probably slow down my select statement by a lot (SELECT * from all_sensors WHERE sensor_id='$sensor_id')? Keep in mind that different sensors measure different things, so this table would have a few dozen columns instead of just one to a few, if I every sensor has its own table?
I also thought about storing the time series data NOT in mySQL, but instead in flat (CSV) files. The graphing library I use for the frontend (dygraphs) deals fine with CSV files (plus it would give me the option of making these available for download, which would be a bonus but is not a requirement currently). I still need the database for other front-end related things, but it would mean having a few dozen tables instead of 11000 (or even more if we add more sensors).
If I create one file for every table, then I would probably run into filesystem limits eventually (this is an ext3 partition, so there's the ~32k files per directory limit). So also here the same question as above applies: should I then store it in one large file that holds all sensors' data? This would probably slow down my reads even worse, as the graphing libary would need to read a much,much bigger file into memory every time someone looks at a graph?

What would you do?

Thanks!

In my experience, flat files is almost never the right way to go. You just can't replicate the optimizations provided by robust data stores. That said, you have many options. Postgres has many packages for time series queries and advanced range indexing, and some NoSQL solutions might be good too, such as Cassandra. It just depends on the tradeoffs you need to make and the types of query patterns you will need. A full tour of all of those trade offs is way out of scope for a question on SO though and you're better served looking for books or articles comparing databases for time series. — ely
Thanks for confirming that flat files is a bad idea, kinda anticipated that already. Also thanks for mentioning noSQL / Cassandra as options. I come from frontend development, so database design choices are not something I usually have to deal with. However, since I already use mySQL for other frontend stuff, and since my SELECT statements to show the graphs are really simple, I might stick with that for the time being (but I'm gonna check out pgSQL's features if I have to do more complicated operations with the time series data). — user1891659

N.B. N.B. · Accepted Answer · 2014-11-18T20:18:22

To answer this question, we must first analyse the real issue you're facing.

The real issue would be the most efficient combination of writing and retrieving data.

Let's review your conclusions:

thousands of tables - well, that violates the purpose of databases and makes it harder to work with. You also gain nothing. There is still disk seeking involved, this time with many file descriptors in use. You also have to know the table names, and there's thousands of them. It's also difficult to extract data, which is what databases are for - to structure the data in such a way that you can easily cross-reference the records. Thousands of tables - not efficient from perf. point of view. Not efficient from use point of view. Bad choice.
a csv file - it is probably excellent for fetching the data, if you need entire contents at once. But it's far from remotely good for manipulating or transforming the data. Given the fact you rely on a specific layout - you have to be extra careful while writing to CSV. If this grows to thousands of CSV files, you didn't do yourself a favor. You removed all the overhead of SQL (which isn't that big) but you did nothing for retrieving parts of the data set. You also have problems fetching historic data or cross referencing anything. Bad choice.

The ideal scenario would be being able to access any part of the data set in an efficient and quick way without any kind of structure change.

And this is exactly the reason why we use relational databases and why we dedicate entire servers with a lot of RAM to those databases.

In your case, you are using MyISAM tables (.MYD file extension). It's an old storage format that worked great for low end hardware which was used back in the day. But these days, we have excellent and fast computers. That's why we use InnoDB and allow it to use a lot of RAM so the I/O costs are reduced. The variable in question that controls it is called innodb_buffer_pool_size - googling that will produce meaningful results.

To answer the question - an efficient, satisfiable solution would be to use one table where you store sensor information (id, title, description) and another table where you store sensor readings. You allocate sufficient RAM or sufficiently fast storage (an SSD). The tables would look like this:

CREATE TABLE sensors ( 
    id int unsigned not null auto_increment,
    sensor_title varchar(255) not null,
    description varchar(255) not null,
    date_created datetime,
    PRIMARY KEY(id)
) ENGINE = InnoDB DEFAULT CHARSET = UTF8;

CREATE TABLE sensor_readings (
    id int unsigned not null auto_increment,
    sensor_id int unsigned not null,
    date_created datetime,
    reading_value varchar(255), -- note: this column's value might vary, I do not know what data type you need to hold value(s)
    PRIMARY KEY(id),
    FOREIGN KEY (sensor_id) REFERENCES sensors (id) ON DELETE CASCADE
) ENGINE = InnoDB DEFAULT CHARSET = UTF8;

InnoDB, by default, uses one flat-file for entire database/installation. That alleviates the problem of exceeding file descriptor limit of the OS / filesystem. Several, or even tens of millions of records should not be a problem if you were to allocate 5-6 gigs of RAM to hold the working data set in memory - that would allow you quick access to the data.

If I were to design such a system, this is the first approach I would make (personally). From there on it's easy to adjust depending on what you need to do with that information.

Efficiently storing time series data: mySQL or flat files? Many tables (or files) or queries with WHERE condition?

1 Answers