0
votes

Our requirement is very simple, we want to store GPS location for sensors which should not be older than a couple of days. The total granularity of data would be max around a minute or so.

Since the total number of sensors could exceed a billion, the SimpleDB is not an option unless I write partitioning logic myself. SimpleDB though indexes each attribute, which makes it possible to run once in a while periodic cleanup scripts which deletes entries older than 2 days.

DynamoDB looks far better since it has no limit on amount of data, I can use partitioned+range primary key on sensorID+timestamp. However, the deletion of old data would require scan query, unless I also have a global secondary index on timestamp field. Usind this secondary global index, the query could potentially be quicker.

Is it just me who believes there could be a better way out there? Using DynamoDB/SimpleDB is better, since the entire deployment is in AWS environment, and we don't want to invest in ops much. I know other NOSQL DBs like Mongo DB supports those.

2
I really don't see how GPS locations, which all have the same format would be better stored in a NOSQL DB. That is really the very definition of data that can be best stored, indexed, retrieved and analyzed in tables.Marcus Müller
Because I'll have billions of entries. I'd like the system to be automatically shareded/partitioned, and not to worry about it. It almost sounds like anybody using the NOSQL without flexible schema is doing a mistake, it is not.Ouroboros
"Billions of identically formed entries" is exactly why you should use relational databases, not schemaless NOSQL. When you've got billions of identical data points, but you're storing them as key-value pairs, yes, you're doing a mistake by not understanding what a database does, and why partitioning key-value stores is much harder than partitioning sortable/indexable tables.Marcus Müller
leave alone the fact that a relational database would need maybe 8B for ID, 8B for timestamp, 4B long, 4B latitude = 32Byte for a whole entry, whereas as your key/value store would need that, plus the amount of storage needed to billion times store the same key (or reference to the same key), plus the structural overhead of having some tree-like structure to hold the key/value pairs belonging to a single entity. Your problem is incredibly RAM and complexity-intense if you do it in key/value databases, and pretty easy and efficient if done in relational databases.Marcus Müller
so, you probably even wouldn't need partitioning to do this in a relational DB on a "normal" sized platform. The whole "old entries must die" thing becomes very easy, since the table typically would grow in one direction, and there's no indirection. automatic database maintenance routines would clean up, and compact the database during usage, and none of this is anything you'd have to care about yourself. Now, explain again, why does key/value have any advantage aside from people mentioning it more often on twitter?Marcus Müller

2 Answers

2
votes

There is a new feature in DynamoDB added. Please check TTL

This will delete the item after the particular item's TTL is expired.

1
votes

You can save entries in date based tables at x day(s) increments.

GPS_LOCATIONS_09052016
GPS_LOCATIONS_09072016
...

Then you can drop old tables every x day(s).

How many GPS locations are there per sensor? If you have for example 500 million unique sensors, then partitioning on sensor id isn't very efficient.

If date based tables don't work out for you, then you can create a GSI on a timestampHash hash key and a timestamp range key, where timestampHash is a number between 1 to y, y depending on your data size. Then you can do a range query against this GSI for every timestampHash and where timestamp is less than now, or whatever you set your purge parameters. The timestampHash will help you partition your data to help with throughput.