Writing custom extractor in USQL to skip rows with encoding problems

Question

I have a large set of data that spans a couple hundred files. Apparently, it's got a few encoding issues in it (it's mostly UTF-8, but apparently some characters just aren't valid). According to https://msdn.microsoft.com/en-us/library/azure/mt764098.aspx if there is an encoding error, a runtime error will occur regardless of setting the silent flag to true (with the aim of just skipping erroring rows).

As a result, I need to write a custom extractor. I've written one that largely does a simplified version of the example at https://blogs.msdn.microsoft.com/data_otaku/2016/10/27/a-fixed-width-extractor-for-azure-data-lake-analytics/ in that it just takes a row, splits it by the delimiter and just returns the values within a try block. If there are any exceptions, I just handle them and move on.

Unfortunately, I'm having an issue actually referencing this extractor in the USQL script itself. When I follow the guidance on the above link, it suggests writing the logic in another assembly, building that, registering it in the ADLS database/assemblies and then including it via REFERENCE ASSEMBLY MyExtractors; at the top of the script (as that is the namespace used). In the below Using statement, I call it with USING new SimpleExtractor(); If I do so, I get an error when running the script against the ADLS service that the type or namespace cannot be found. Additionally, if I attempt to be more precise and use USING new MyExtractors.SimpleExtractor(); in the using statement, it yields the same error, citing the USING statement above.

I then found additional documentation in an older source at https://azure.microsoft.com/en-us/documentation/articles/data-lake-analytics-u-sql-develop-user-defined-operators/ that describes doing the same thing but in the code-behind file. I deleted the separate assembly and copied the logic into a class in that file. The example in step #6 doesn't show any REFERENCE ASSEMBLY statements, but again, when I run it, I get an error that the type or namespace name cannot be found.

Looking at the most recent release notes in hopes that something is just out of date here, the only thing I see is that if I use a USING statement , I need a reference to the custom code's assembly (as in the first attempt) prior to actually using it, which I am.

Can anyone please provide some guidance on how to properly reference UDOs in USQL, or otherwise indicate how to have the runtime handle Encoding exceptions silently (and just skip them)?

Here's what my logic is looking like in the extractor itself:

using System.Collections.Generic;
using System.IO;
using System.Text;
using Microsoft.Analytics.Interfaces;

namespace Utilities
{
    [SqlUserDefinedExtractor(AtomicFileProcessing = true)]
    public class ModifiedTextExtractor : IExtractor
    {
        //Contains the row
        private readonly Encoding _encoding;
        private readonly byte[] _row_delim;
        private readonly char _col_delim;

        public ModifiedTextExtractor()
        {
            _encoding = Encoding.UTF8;
            _row_delim = _encoding.GetBytes("\r\n");
            _col_delim = '\t';
        }

        public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
        {
            //Read the input line by line
            foreach (var current in input.Split(_row_delim))
            {
                using (var reader = new StreamReader(current, this._encoding))
                {
                    var line = reader.ReadToEnd().Trim();

                    //If there are any single or double quotes in the line, escape them
                    line = line.Replace(@"""", @"\""");

                    var count = 0;

                    //Split the input by the column delimiter
                    var parts = line.Split(_col_delim);

                    foreach (var part in parts)
                    {
                        output.Set<string>(count, part);
                        count += 1;
                    }
                }
                yield return output.AsReadOnly();
            }
        }
    }
}

And a snippet of how I'm trying to use it in the USQL statement (after registering it as an assembly):

REFERENCE ASSEMBLY [Utilities];

CREATE VIEW MyView AS ...
USING new Utilities.ModifiedTextExtractor();

Thank you!

Dear Xaniff Can you show me your namespace/class name that you use in your UDO C# code? What I assume is that you either forgot to mark the extractor as public static or you have a typo in your Extractor class name. — Michael Rys
@MichaelRys Added the code I've got implemented to the question - thanks! — Whit Waldo

Michael Rys Michael Rys · Accepted Answer · 2016-11-11T18:35:23

You are running into the problem that VIEWs cannot refer to custom code. In U-SQL, all objects need to contain their context specification (such as the referenced assemblies in their body (this makes the objects more self-contained and avoids the problem of pulling a potentially long line of dependencies unbeknownst to the user of the object).

What you need to do is turning the VIEW into a Table-Valued Function:

CREATE FUNCTION MyFunct(/* optional parameters */) RETURNS @res AS
BEGIN
  REFERENCE ASSEMBLY [Utilities];
  @res = EXTRACT ... USING new Utilities.ModifiedTextExtractor();
END;

And then call the function as follows (note you need to provide the rowset alias in the SELECT statement):

@data = SELECT ... FROM MyFunct() AS f WHERE ...;

or if you don't want to apply a projection or filter:

@data = MyFunct();

Like a view a table-valued function will be inlined.

Writing custom extractor in USQL to skip rows with encoding problems

2 Answers