0
votes

In my organisation we have multiple devices sending data every seconds. The data is processed and partitioned in AWS S3 like this /year=YYYY/month=MM/day=DD/file.csv.

Using AWS Athena we use to run queries like this: SELECT col1, col2, coln FROM data WHERE year = 'YYYY' AND month = 'MM' and DAY = 'dd' AND device_id = 123 to retrieve data from one device for some time in a day. Sometimes we also need to get data from multiple devices (device_id IN (...)) and at different times. Note that the columns device_id and ts exist in the dataset and only ts is used to generate partitions.

Here's my question:

Will this method of partitioning be efficient in a long term ? At this time, we only have about 150 active devices, but we plan to scale at 1000 and more. Considering the fact that the query schema would be the same (get data for some device at a certain time), is it better to partition by device_id and then by date (/devive_id/year=YYYY/month=MM/day=DD/file.csv) ?

1

1 Answers

2
votes

The partitioning is very good for your supplied query -- it will only need to look in one subdirectory for that single day of data.

However, if you were querying for a specific device across all time (without specifying a month/day), then it would not be efficient.

You will need to decide what is going to be more common:

  • If a specific device will always be queried, then partition by Device, then Date
  • If a specific day/month will always be queried, then your current method is fine (possibly with an additional partition of device after Day)