1
votes

I have a big blob storage full of log files organized according to their identifiers at a number of levels: repository, branch, build number, build step number.

These are JSON files that contain an array of objects, each object has a timestamp and an entry value. I've already implemented a custom extractor (extending IExtractor) that takes an input stream and produces a number of plain-text lines.

Initial load

Now I am trying to load all of that data to ADL Store. I created a query that looks similar to this:

@entries =
  EXTRACT
    repo string,
    branch string,
    build int,
    step int,
    Line int,
    Entry string
  FROM @"wasb://my.blob.core.windows.net/{repo}/{branch}/{build}/{step}.json"
  USING new MyJSONExtractor();

When I run this extraction query I get a compiler error - it exceeds the limit of 25 minutes of compilation time. My guess is: too many files. So I add a WHERE clause in the INSERT INTO query:

INSERT INTO Entries
(Repo, Branch, Build, Step, Line, Entry)
SELECT * FROM @entries
WHERE (repo == "myRepo") AND (branch == "master");

Still no luck - compiler times out.

(It does work, however, when I process a single build, leaving {step} as the only wildcard, and hard-coding the rest of names.)

Question: Is there a way to perform a load like that in a number of jobs - but without the need to explicitly (manually) "partition" the list of input files?

Incremental load

Let's assume for a moment that I succeeded in loading those files. However, a few days from now I'll need to perform an update - how am I supposed to specify the list of files? I have a SQL Server database where all the metadata is kept, and I could extract exact log file paths - but U-SQL's EXTRACT query forces me to provide a static string that specifies the input data.

A straightforward scenario would be to define a top-level directory for each date and process them day by day. But the way the system is designed makes this very difficult, if not impossible.

Question: Is there a way to identify files by their creation time? Or maybe there is a way to combine a query to a SQL Server database with the extraction query?

2

2 Answers

1
votes

For your first question: Sounds like your FileSet pattern is generating a very large number of input files. To deal with that you may want to try the FileSets v2 preview which is documented under U-SQL Preview Features section in: https://github.com/Azure/AzureDataLake/blob/master/docs/Release_Notes/2017/2017_04_24/USQL_Release_Notes_2017_04_24.md

Input File Set scales orders of magnitudes better (opt-in statement is now provided)

Previously, U-SQL's file set pattern on EXTRACT expressions ran into compile time time-outs around 800 to 5000 files.

U-SQL's file set pattern now scales to many more files and generates more efficient plans.

For example, a U-SQL script querying over 2500 files in our telemetry system previously took over 10 minutes to compile now compiles in 1 minute and the script now executes in 9 minutes instead of over 35 minutes using a lot less AUs. We also have compiled scripts that access 30'000 files.

The preview feature can be turned on by adding the following statement to your script:

SET @@FeaturePreviews = "FileSetV2Dot5:on";

If you wanted to generate multiple extract statements based on partitions of your filepaths, you'd have to do it with some external code that generates one or more U-SQL scripts.

I don't have a good answer to your second question so I will get a colleague to respond. Hopefully the first part can get you unblocked for now.

0
votes

To address your second question:

You could read your data from the SQL Server database using a federated query, and then use the information in a join with the virtual columns that you create from the fileset. The problem with that is that the values are only known at execution time and not at compile time, so you would not get the reduction in the accessed files.

Alternatively, you could write a SQL query that gets you the data you need and then parameterize your U-SQL script so you can pass that information into the U-SQL script.

As to the ability to select files based on their creation time: This is a feature on our backlog. I would recommend to upvote and add a comment to the following feature request: https://feedback.azure.com/forums/327234-data-lake/suggestions/10948392-support-functionality-to-handle-file-properties-fr and add a comment you want to also query on them over a fileset.