3
votes

What are the guidelines, or where can we find guidelines for designing a system for optimal parallelism. I understand that the data is split out across the individual nodes and optimized to do so.

The data that I have in files currently has multiple customer, sites, products and users in it. I need to aggregate by customer, site, product which means that subsets of that data can be easily calculated in individual nodes and the brought back to a single node for output at the end of processing.

However I am not seeing that level of parallelism in the job graph. It is showing MDOP but not in a way that seems optimal. I have 4 different calculations that are done independently on customer, site, product. It is paralleling the 4 calculations, but doing it on the entire dataset. When in reality it should be able to fan it out where say 10 nodes get a 1 customer each, then each of those nodes could fan its calculations out to 4 more nodes. (note numbers here simply for example, scale of data is much larger).

How do I optimize either layout of files or logic of U-SQL to encourage more MDOP?

1

1 Answers

1
votes

Is your data coming from unstructured files or U-SQL tables? How much data are you processing (you need more than 250MB in a non-partitioned file to get parallelism).

If the data comes from files, you may be able to partition the files and use file sets and ROWCOUNT hints to get better parallelism.

If the data is stored in U-SQL tables, you can use the table partitioning and column clustering to influence the parallelization.

Also, at what level do you look at the parallelization? Note that the job graph will show you super vertexes (SV) that just show the group of vertices that do the same job. Each vertex within (if there is enough data) will execute in parallel on a slice of data. And - if possible - will be passed to other vertexes with minimal shuffling.