1
votes

Within my Azure Data Lake directory ingest, I have numbered files:

ingest/file1.tsv
ingest/file2.tsv
ingest/file3.tsv

Within an Azure Data Factory v2 U-SQL activity, how can I run Extractors.Tsv() on only the highest numbered file in the FileSet?

1

1 Answers

2
votes

One approach would require two scripts:

  1. Script 1 would get the max value using the file set feature, and then create the second script as a file (that you store in a WASB account, unless ADF allows you to refer to scripts in ADLS now as well).

  2. Then execute Script 2.

Another approach would be to get the highest number using the ADLS SDK and then pass that value to a script that uses a file set on the number as in:

DECLARE EXTERNAL @fileno int;
@d = EXTRACT ..., fileno int 
     FROM "ingest/file{fileno}.tsv" 
     USING Extractors.Tsv();
// predicate will be pushed into EXTRACT statement.
@d = SELECT * FROM @d WHERE fileno == @fileno;

The syntactically simplest solution in a single script would be the following:

@d = EXTRACT ..., fileno int 
     FROM "ingest/file{fileno}.tsv" 
     USING Extractors.Tsv();

@maxfno = SELECT DISTINCT(MAX(fileno)) AS maxno FROM @d;

@d = SELECT * FROM @d JOIN @maxfno ON fileno == maxno;

But that statement is not constant-foldable and thus you would read all files anyway.