What are the guidelines, or where can we find guidelines for designing a system for optimal parallelism. I understand that the data is split out across the individual nodes and optimized to do so.
The data that I have in files currently has multiple customer, sites, products and users in it. I need to aggregate by customer, site, product which means that subsets of that data can be easily calculated in individual nodes and the brought back to a single node for output at the end of processing.
However I am not seeing that level of parallelism in the job graph. It is showing MDOP but not in a way that seems optimal. I have 4 different calculations that are done independently on customer, site, product. It is paralleling the 4 calculations, but doing it on the entire dataset. When in reality it should be able to fan it out where say 10 nodes get a 1 customer each, then each of those nodes could fan its calculations out to 4 more nodes. (note numbers here simply for example, scale of data is much larger).
How do I optimize either layout of files or logic of U-SQL to encourage more MDOP?