1
votes

I’m implementing a custom U-SQL Extractor for our internal file format (binary serialization). It works well in the "Atomic" mode:

[SqlUserDefinedExtractor(AtomicFileProcessing = true)]
public class BinaryExtractor : IExtractor

If I switch off the “Atomic“ mode, It looks like U-SQL is splitting the file in a random place (I guess just by 250MB chunks). This is not acceptable for me. The file format has a special row delimiter. Can I define a custom row delimiter in my Extractor and enable parallelism for it. Technically I can change our row delimiter to a new one if it can help. Could anyone help me with this question?

1

1 Answers

3
votes

The file is indeed split into chunks (I think it is 1 GB at the moment, but the exact value is implementation defined and may change for performance reasons).

If the file is indeed row delimited, and assuming your raw input data for the row is less than 4MB, you can use the input.Split() function inside your UDO to do the splitting into rows. The call will automatically handle the case if the raw input data spans the chunk boundary (assuming it is less than 4MB).

Here is an example:

public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow outputrow)
{
   // this._row_delim = this._encoding.GetBytes(row_delim); in class ctor
   foreach (Stream current in input.Split(this._row_delim))
   {
       using (StreamReader streamReader = new StreamReader(current, this._encoding))
       {
           int num = 0;
           string[] array = streamReader.ReadToEnd().Split(new string[]{this._col_delim}, StringSplitOptions.None);
           for (int i = 0; i < array.Length; i++)
           {
              // DO YOUR PROCESSING
           }
       }
       yield return outputrow.AsReadOnly();
    }
}

Please note that you cannot read across chunk boundaries yourself and you should make sure your data is indeed splittable into rows.