1
votes

I'm processing table "Content" with User Defined Function "TrasformData"

@result = 
SELECT Id,
       TrasformData(Data) AS TrasformedData
FROM Content;

The table "Content" is big (about 100M records) and "TrasformData" function is slow. The function is very complex and takes about 20 milliseconds for one record.

Azure Data Lake splits my query into 25 Vertices be default. It's not enough. it may take hours to finish on 25 AU. I would like to allocate at least 200 AU for this process and finish it as fast as I can. As far as I understand it's useless to allocate more then 25 AU for this query until it splits into into 25 Vertices.

Can I somehow increase parallelism for my query? Could anyone help me on this question? Any options are acceptable.

1

1 Answers

3
votes

You have three options to increase parallelism of your job.

The first one is to increase the numbers of distribution buckets of your Content table (assuming you are using a table which your statement about 25 vertices seems to indicate).

If you are using HASH distribution, you can specify a larger number with adding the INTO x clause (where x is the number of buckets, should be less than about 2000).

The other two options are similar: use hints to trick the system into using more nodes.

One is to use a data hint on the previous result to claim that it returns a really large numbers of rows. This will trick the system to re-partition the data and will scale future processing out (adjust the number to something you are happy with):

@content = SELECT * FROM Content OPTION(ROWSIZE=1000000000);
@result = SELECT id, TransformData(Data) AS TransformData FROM @content;

The other is to change the partitioning explicitly:

@content = 
   SELECT * 
   FROM Content 
   OPTION(PARTITION(id)=(PARTITIONCOUNT=100));
@result = SELECT id, TransformData(Data) AS TransformData FROM @content;

Both of these approaches are however giving you a fixed allocation that will overallocate if your data is less or underallocate if your data is much more.