I am quite new to ADL and USQL. I went through quite a lot of documentation and presentations but I am afraid that I am still lacking many answers.
Simplifying a bit, I have a large set of data (with daily increments) but it contains information about many different clients in one file. In most cases the data will be analysed for one client (one report = one client), but I would like to keep the possibility to do a cross-client analysis (much less common scenario). I am aware of the importance of correctly partitioning this data (probably keeping one client data together makes most sense). I was looking into two scenarios:
- I will partition the data myself by splitting the files into folder-file structure where I can have full control over how it is done, how big the files are etc.
- I will use managed tables and set up table partitioning
I am now looking into pros and cons of both scenarios. Some things that come to my mind are:
- The ability to compress data in scenario #1 (at the cost of performance of course)
- The ability to build a much more granular security model by using files and ADL security (e.g. give access only to one client's data)
- On the other hand, using the tables is much more comfortable as I will be dealing with just one data source and will not have to worry about extracting correct files, only about the correct filters in a query - in theory USQL should do the rest
- I would expect that the tables will offer better performance
One very important factor I wanted to investigate, before making my decision, is how the data is stored physically when using tables and partitions. I have read the documentation and I found a statement that confused me (https://docs.microsoft.com/en-us/u-sql/ddl/tables):
First we can read that:
"U-SQL tables are backed by files. Each table partition is mapped to its own file" - this seems to make perfect sense. I would assume that if I set up partitioning by client I would end up with the same scenario as doing the partitioning myself. Fantastic! U-SQL will do all the work for me! Or.. will it not?
Later we can read that:
"..., and each INSERT statement adds an additional file (unless a table is rebuilt with ALTER TABLE REBUILD)."
Now this makes things more complicated. If I read it correctly, this means that if I will never rebuild a table I will have my data stored physically in exactly the same way as the original raw files and thus experience bad performance. I did some experiments and it seemed to work this way. Unfortunately, I was not able to match the files with partitions as the guids were different (the .ss files in the store had different guids than partitions in usql views) so this is just my guess.
Therefore I have several questions:
- Is there some documentation explaining in more detail how TABLE REBUILD works?
- What is the performance of TABLE REBUILD? Will it work better than my idea of appending (extract -> union all -> output) just the files that need to be appended?
- How can I monitor the size of my partitions? In my case (running local, have not checked it online yet) the guids of files and partitions in the store do not match even after REBUILD (they do for the DB, schema and table)
- Is there any documentation explaining in more details how .ss files are created?
- Which of the scenarios would you take and why?
Many thanks for your help,
Jakub
EDIT: I did some more tests and it only made it more intriguing.
- I took a sample of 7 days of data
- I created a table partitioned by date
- I created 8 partitions - one for each day + one default
- I imported data from the 7 days - as a result, in the catalogue I got 8 files corresponding (probably) to my partitions
- I imported the same file once a gain - as a result, in the catalogue I got 16 files (1 per partition per import - the sizes of the files matched exactly)
- To be tripple sure I did it once again and got 24 files (again 1 per partition per import, sizes match)
- I did the TABLE REBUILD - ended up again with 8 files (8 partitions) - makes sense
- I imported the file once again - ended up having 16 files (sizes don't match so I guess I got 8 files for the partition and 8 files for the import - 1 per partition)
- I did the TABLE REBUILD - edned up again with 8 files - size still growing - still makes sense, but... this is where it gets funny
- I then imported another file containing only 2 days of data I ended up with... nope, you didn't guess! - 16 files. So I got 8 files with the large partitions, 2 larger files with new import for 2 days and 6 very small files
- Being even more intrigued I ran the TABLE REBUILD
- I endeed up with 8 files (for each partitions) but... they were all just recently modified
Conclusion? If I am not mistaken, this looks like the rebuild will actually touch all my files no matter what I just inserted. If this is the case, it means that the whole scenario will become more and more expensive over time as the data grows. Is there anyone who could please explain I am wrong?