2
votes

I have a flatten dataset, each row contain user attributes (age, loc,etc..), register & visit datetime. partition per day. ~10m visit rows per day, 25m users, 5m users each day. This is working now with few months data, and for 1 year, it will be ~3billion+ rows.

For efficiency & reducing size, I was thinking of moving to nested rows: each user will have nested records with only register & visit datetimes.

before I do the big change, & assuming I wont pass the 64K limit per row & I'll change my queries accordingly. will this perform better then flatten rows?

Issues:

  • if I use nested I loose the daily partitions by visit date, since I nest them into one record. ( I can partition by month?)

  • when Loading, I'll need to convert the CSV to JSON & know to which partition to load each row, so I guess I'll cancel partitioning.

  • query performance on fewer partitions but nested should be better?

Thx a lot

1

1 Answers

0
votes

1: The actual import row size limit for json data is 20 MB, not 64k. I've filed a doc bug to get this updated in our public documentation.

  1. Nesting your data may improve performance somewhat, but without knowing the actual queries, it is difficult to know. It is unlikely to have a large effect on performance. It will, however, likely make your queries less expensive.

  2. You may be able to do a union of multiple partitions and then group by or a group each by to collect visits per user. It would probably be a good idea to try this out with a smaller sample of your data before productionizing it.

My guess is that the more complex queries and loading logic will mean that you won't see much benefit by nesting your data. You will likely be better off just keeping your data denormalized and flattened, partitioning the data by day, and querying over the union of tables that are needed in the query. We're working on ways to make it easier to specify ranges of tables in a query, which may be helpful.