0
votes

I'm backing up my kafka topics to s3 using confluent's kafka-connect-s3 https://www.confluent.io/hub/confluentinc/kafka-connect-s3. I want to be able to easily query this data using Athena and have it properly partitioned for cheap/fast reads.

I want to partition by (year/month/day/topic) tuple. I already have the year/month/day part solved by using a Daily partitioner https://docs.confluent.io/kafka-connect-s3-sink/current/index.html#partitioning-records-into-s3-objects. Now year=YYYY/month=MM/day=DD is worked into the path so any hive-based querying is optimized / partitioned on time. Looking at msck explanation, notice the example using userid=

https://docs.aws.amazon.com/athena/latest/ug/msck-repair-table.html

However, based off these docs https://docs.confluent.io/kafka-connect-s3-sink/current/index.html#s3-object-names I get {topic} in the path but there's no way to modify it to topic={topic}. I could work this into the prefix (instead of env={env} the prefix would be env={env}/topic={topic}) but that seems redundant with another only-child directory {topic} underneath it.

I also noticed topic name is in the message name delimitated by + (along with partition and starting offset).

My question . . . how can I get topic={topic} in my path so hive-based queries automatically create that partition? Or do I already get that for free by having it in the path (with no topic=) or in the message name (again, with no topic=)

1
Note: The S3 Sink shouldn't be considered a "backup" since there is metadata that is lost when they get written - OneCricketeer
@OneCricketeer thanks for that. What alternative would you recommend as a backup? We don't care about most of the metadata (except for metadata like consumer-offsets which is its own topic we back up in and of itself). - Daniel Epstein
There's this S3 connector, and a few forks of it that store the raw binary data, however I say its not a backup because there isn't really a proper "restoration" tool; it's just raw binary data written to S3, so if you really need a small RTO, then you'd need to prepare for that scenario with your own tooling. Same can be said for the offsets topic since the keys are very important for that (also the _schemas topic if you use the Schema Registry) - OneCricketeer

1 Answers

0
votes

how can I get topic={topic} in my path so hive-based queries automatically create that partition?

There isn't.

The recommendation would be to make a partitioned table per topic rather than have the topic be a partition itself