Data Lake Analytics - Large vertex query

Question

I have a simple query which make a GROUP BY using two fields:

@facturas =
    SELECT a.CodFactura,
       Convert.ToInt32(a.Fecha.ToString("yyyyMMdd")) AS DateKey,           
      SUM(a.Consumo) AS Consumo
    FROM @table_facturas AS a
    GROUP BY a.CodFactura, a.DateKey;

@table_facturas has 4100 rows but query takes several minutes to finish. Seeing the graph explorer I see it uses 2500 vertices because I'm having 2500 CodFactura+DateKey unique rows. I don't know if it normal ADAL behaviour. Is there any way to reduce the vertices number and execute this query faster?

Michael Rys Michael Rys · Accepted Answer · 2017-11-22T22:36:01

First: I am not sure your query actually will compile. You would need the Convert expression in your GROUP BY or do it in a previous SELECT statement.

Secondly: In order to answer your question, we would need to know how the full query is defined. Where does @table_facturas come from? How was it produced?

Without this information, I can only give some wild speculative guesses:

If @table_facturas is coming from an actual U-SQL Table, your table is over partitioned/fragmented. This could be because:
- you inserted a lot of data originally with a distribution on the grouping columns and you either have a predicate that reduces the number of rows per partition and/or you do not have uptodate statistics (run CREATE STATISTICS on the columns).
- you did a lot of INSERT statements, each inserting a small number of rows into the table, thus creating a big number of individual files. This will "scale-out" the processing as well. Use ALTER TABLE REBUILD to recompact.
If it is coming from a fileset, you may have too many small files in the input. See if you can merge them into less, larger files.

You can also try to hint a small number of rows in your query that creates @table_facturas if the above does not help by adding OPTION(ROWCOUNT=4000).

Data Lake Analytics - Large vertex query

1 Answers