3
votes

I'm trying to build a system that ingests, stores and can query app event data. In the future it will be used for other tasks (ML, Analytics, etc.) hence why I think Databricks could be a good option(for now).

The main use case will be retrieving user-action events occurring in the app. Batches of this event data will land in an S3 bucket about every 5-30 mins and Databricks Auto Loader will pick them up and store it in a Delta Table.

A typical query will be: get all events where colA = x over the last day, week, or month.

I think the typical strategy here is to partition by date. e.g:

date_trunc("day", date) # 2020-04-11T00:00:00:00.000+000

This will create 365 partitions in a year. I expect each partition to hold about 1GB of data. In addition to partitioning, I plan on using z-ordering for one of the high cardinality columns that will frequently be used in the where clause.

Is this too many partitions? Is there a better way to partition this data? Since I'm partitioning by day and data is coming in every 5-30 mins, is it possible to just "append" data to a days partition instead?

1

1 Answers

2
votes

It's really depends on the amount of data that are coming per day and how many files should be read to answer your query. If it 10th of Gb then partition per day is ok. But you can also partition by timestamp truncated to week, and in this case you'll get only 52 partitions per year. ZOrdering will help to keep the files optimized, but if you're appending data every 5-30 minutes, you'll get with at least 24 files per day inside the partition, so you will need to run OPTIMIZE with ZOrder every night, or something like this, to decrease the number of files. Also, make sure that you're using optimized writes - although this make write operation slower, it will decrease the number of files generated (if you're planning to use ZOrdering, then it makes no sense to enable autocompaction)