I'm trying to build a system that ingests, stores and can query app event data. In the future it will be used for other tasks (ML, Analytics, etc.) hence why I think Databricks could be a good option(for now).
The main use case will be retrieving user-action events occurring in the app. Batches of this event data will land in an S3 bucket about every 5-30 mins and Databricks Auto Loader will pick them up and store it in a Delta Table.
A typical query will be: get all events where colA = x over the last day, week, or month.
I think the typical strategy here is to partition by date. e.g:
date_trunc("day", date) # 2020-04-11T00:00:00:00.000+000
This will create 365 partitions in a year. I expect each partition to hold about 1GB of data. In addition to partitioning, I plan on using z-ordering for one of the high cardinality columns that will frequently be used in the where clause.
Is this too many partitions? Is there a better way to partition this data? Since I'm partitioning by day and data is coming in every 5-30 mins, is it possible to just "append" data to a days partition instead?