0
votes

It seems like I'm hitting a 1GB upper-boundary on my U-SQL input file size. Is there such a limit, and if so, how can this be increased?

Here's my case in a nutshell:

I'm working on a custom xml extractor where I'm processing XML files of roughly 2,5gb. These XML files conform to well maintained XSD schemas. using xsd.exe I've generated .NET classes for Xml serialization. The custom extractor uses these desialized .NET objects to populate the output rows.

This all works pretty neat running U-SQL on my local ADLA Account from Visual Studio. Memory usage goes up to approx 3 gb for a 2,5 gb input xml, so this should perfectly fit on a single vertex per file. This still works great using <1gb input files on the Data Lake. However, when trying to scale things up at the Data Lake Store, it seems the job got terminated by hitting the 1gb input file size boundary.

I know streaming the outer XML, and then serializing the inner XML fragments is an alternative option, but we don't want to create - and particularly maintain - too much custom code depending on those externally managed schemas. Therefore, raising the upper-limit would be great.

Total Data Read Upper Boundary?

1

1 Answers

1
votes

I see two issues right now. One that we can address, and one for which we have a feature under development for later this year.

  1. U-SQL per default assumes that you want to scale out processing over your file and will split it into 1GB "chunks" for extraction. If your extractor needs to see all the data (e.g., in order to parse XML or JSON or an image for example) you need to mark the extractor to process the files atomically (not splitting it) in the following way:

    [SqlUserDefinedExtractor(AtomicFileProcessing = true)]
    public class MyExtractor : IExtractor
    { ...
    
  2. Now while a vertex has 3GB of data, we currently limit the memory size for a UDO like an extractor to 500MB. So if you process your XML in a way that requires a lot of memory, you will currently still fail with a System.OutOfMemory error. We are working on adding annotations to the UDOs that let you specify your memory requirements to overwrite the default, but that is still under development at this point. The only ways to address that is to either make your data small enough, or - in the case of XML for example - use a streaming parsing strategy that does not allocate too much memory (e.g., use the XML Reader interface).