Incremental load in Azure Data Lake

Question

I have a big blob storage full of log files organized according to their identifiers at a number of levels: repository, branch, build number, build step number.

These are JSON files that contain an array of objects, each object has a timestamp and an entry value. I've already implemented a custom extractor (extending IExtractor) that takes an input stream and produces a number of plain-text lines.

Initial load

Now I am trying to load all of that data to ADL Store. I created a query that looks similar to this:

@entries =
  EXTRACT
    repo string,
    branch string,
    build int,
    step int,
    Line int,
    Entry string
  FROM @"wasb://my.blob.core.windows.net/{repo}/{branch}/{build}/{step}.json"
  USING new MyJSONExtractor();

When I run this extraction query I get a compiler error - it exceeds the limit of 25 minutes of compilation time. My guess is: too many files. So I add a WHERE clause in the INSERT INTO query:

INSERT INTO Entries
(Repo, Branch, Build, Step, Line, Entry)
SELECT * FROM @entries
WHERE (repo == "myRepo") AND (branch == "master");

Still no luck - compiler times out.

(It does work, however, when I process a single build, leaving {step} as the only wildcard, and hard-coding the rest of names.)

Question: Is there a way to perform a load like that in a number of jobs - but without the need to explicitly (manually) "partition" the list of input files?

Incremental load

Let's assume for a moment that I succeeded in loading those files. However, a few days from now I'll need to perform an update - how am I supposed to specify the list of files? I have a SQL Server database where all the metadata is kept, and I could extract exact log file paths - but U-SQL's EXTRACT query forces me to provide a static string that specifies the input data.

A straightforward scenario would be to define a top-level directory for each date and process them day by day. But the way the system is designed makes this very difficult, if not impossible.

Question: Is there a way to identify files by their creation time? Or maybe there is a way to combine a query to a SQL Server database with the extraction query?

OmidA OmidA · Accepted Answer · 2017-06-23T06:50:06

For your first question: Sounds like your FileSet pattern is generating a very large number of input files. To deal with that you may want to try the FileSets v2 preview which is documented under U-SQL Preview Features section in: https://github.com/Azure/AzureDataLake/blob/master/docs/Release_Notes/2017/2017_04_24/USQL_Release_Notes_2017_04_24.md

Input File Set scales orders of magnitudes better (opt-in statement is now provided)

Previously, U-SQL's file set pattern on EXTRACT expressions ran into compile time time-outs around 800 to 5000 files.

U-SQL's file set pattern now scales to many more files and generates more efficient plans.

For example, a U-SQL script querying over 2500 files in our telemetry system previously took over 10 minutes to compile now compiles in 1 minute and the script now executes in 9 minutes instead of over 35 minutes using a lot less AUs. We also have compiled scripts that access 30'000 files.

The preview feature can be turned on by adding the following statement to your script:

SET @@FeaturePreviews = "FileSetV2Dot5:on";

If you wanted to generate multiple extract statements based on partitions of your filepaths, you'd have to do it with some external code that generates one or more U-SQL scripts.

I don't have a good answer to your second question so I will get a colleague to respond. Hopefully the first part can get you unblocked for now.

Incremental load in Azure Data Lake

2 Answers