I have a flatten dataset, each row contain user attributes (age, loc,etc..), register & visit datetime. partition per day. ~10m visit rows per day, 25m users, 5m users each day. This is working now with few months data, and for 1 year, it will be ~3billion+ rows.
For efficiency & reducing size, I was thinking of moving to nested rows: each user will have nested records with only register & visit datetimes.
before I do the big change, & assuming I wont pass the 64K limit per row & I'll change my queries accordingly. will this perform better then flatten rows?
Issues:
if I use nested I loose the daily partitions by visit date, since I nest them into one record. ( I can partition by month?)
when Loading, I'll need to convert the CSV to JSON & know to which partition to load each row, so I guess I'll cancel partitioning.
- query performance on fewer partitions but nested should be better?
Thx a lot