2
votes

I have a large set of data that spans a couple hundred files. Apparently, it's got a few encoding issues in it (it's mostly UTF-8, but apparently some characters just aren't valid). According to https://msdn.microsoft.com/en-us/library/azure/mt764098.aspx if there is an encoding error, a runtime error will occur regardless of setting the silent flag to true (with the aim of just skipping erroring rows).

As a result, I need to write a custom extractor. I've written one that largely does a simplified version of the example at https://blogs.msdn.microsoft.com/data_otaku/2016/10/27/a-fixed-width-extractor-for-azure-data-lake-analytics/ in that it just takes a row, splits it by the delimiter and just returns the values within a try block. If there are any exceptions, I just handle them and move on.

Unfortunately, I'm having an issue actually referencing this extractor in the USQL script itself. When I follow the guidance on the above link, it suggests writing the logic in another assembly, building that, registering it in the ADLS database/assemblies and then including it via REFERENCE ASSEMBLY MyExtractors; at the top of the script (as that is the namespace used). In the below Using statement, I call it with USING new SimpleExtractor(); If I do so, I get an error when running the script against the ADLS service that the type or namespace cannot be found. Additionally, if I attempt to be more precise and use USING new MyExtractors.SimpleExtractor(); in the using statement, it yields the same error, citing the USING statement above.

I then found additional documentation in an older source at https://azure.microsoft.com/en-us/documentation/articles/data-lake-analytics-u-sql-develop-user-defined-operators/ that describes doing the same thing but in the code-behind file. I deleted the separate assembly and copied the logic into a class in that file. The example in step #6 doesn't show any REFERENCE ASSEMBLY statements, but again, when I run it, I get an error that the type or namespace name cannot be found.

Looking at the most recent release notes in hopes that something is just out of date here, the only thing I see is that if I use a USING statement , I need a reference to the custom code's assembly (as in the first attempt) prior to actually using it, which I am.

Can anyone please provide some guidance on how to properly reference UDOs in USQL, or otherwise indicate how to have the runtime handle Encoding exceptions silently (and just skip them)?

Here's what my logic is looking like in the extractor itself:

using System.Collections.Generic;
using System.IO;
using System.Text;
using Microsoft.Analytics.Interfaces;

namespace Utilities
{
    [SqlUserDefinedExtractor(AtomicFileProcessing = true)]
    public class ModifiedTextExtractor : IExtractor
    {
        //Contains the row
        private readonly Encoding _encoding;
        private readonly byte[] _row_delim;
        private readonly char _col_delim;

        public ModifiedTextExtractor()
        {
            _encoding = Encoding.UTF8;
            _row_delim = _encoding.GetBytes("\r\n");
            _col_delim = '\t';
        }

        public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
        {
            //Read the input line by line
            foreach (var current in input.Split(_row_delim))
            {
                using (var reader = new StreamReader(current, this._encoding))
                {
                    var line = reader.ReadToEnd().Trim();

                    //If there are any single or double quotes in the line, escape them
                    line = line.Replace(@"""", @"\""");

                    var count = 0;

                    //Split the input by the column delimiter
                    var parts = line.Split(_col_delim);

                    foreach (var part in parts)
                    {
                        output.Set<string>(count, part);
                        count += 1;
                    }
                }
                yield return output.AsReadOnly();
            }
        }
    }
}

And a snippet of how I'm trying to use it in the USQL statement (after registering it as an assembly):

REFERENCE ASSEMBLY [Utilities];

CREATE VIEW MyView AS ...
USING new Utilities.ModifiedTextExtractor();

Thank you!

2
Dear Xaniff Can you show me your namespace/class name that you use in your UDO C# code? What I assume is that you either forgot to mark the extractor as public static or you have a typo in your Extractor class name.Michael Rys
@MichaelRys Added the code I've got implemented to the question - thanks!Whit Waldo

2 Answers

2
votes

You are running into the problem that VIEWs cannot refer to custom code. In U-SQL, all objects need to contain their context specification (such as the referenced assemblies in their body (this makes the objects more self-contained and avoids the problem of pulling a potentially long line of dependencies unbeknownst to the user of the object).

What you need to do is turning the VIEW into a Table-Valued Function:

CREATE FUNCTION MyFunct(/* optional parameters */) RETURNS @res AS
BEGIN
  REFERENCE ASSEMBLY [Utilities];
  @res = EXTRACT ... USING new Utilities.ModifiedTextExtractor();
END;

And then call the function as follows (note you need to provide the rowset alias in the SELECT statement):

@data = SELECT ... FROM MyFunct() AS f WHERE ...;

or if you don't want to apply a projection or filter:

@data = MyFunct();

Like a view a table-valued function will be inlined.

0
votes

An alternative approach to this problem would be to use Azure SQL Data Warehouse and Polybase, which supports rejecting rows.

1) Create an external table in ADW on your external files:

CREATE EXTERNAL TABLE ext.mycsv (
    colA INT NOT NULL,
    colB INT NOT NULL,
    colC INT NOT NULL
)
WITH (
    DATA_SOURCE = eds_mycsv,
    LOCATION = N'/myblobstorage/',
    FILE_FORMAT = eff_csv,
    REJECT_TYPE = VALUE,
    REJECT_VALUE = 1
)

The external table can point at a sinlge file or directory (multiple files with the same structure as in my example). The reject_value of 1 will allow one row to fail without failing the entire job. This can also be a percentage, ie "allow 3% of rows to fail without failing the entire load. The statement will also give you information about the failed rows. Read more about REJECT_TYPE and REJECT_VALUE here.

The other great thing about ADW is that it can be paused when you are not using it.

2) Create an internal table in ADW to materialise it, eg

CREATE TABLE dbo.mycsv
WITH 
(   
    CLUSTERED COLUMNSTORE INDEX,
    DISTRIBUTION = ROUND_ROBIN
)
AS
SELECT * FROM ext.mycsv;

3) Create an external table within Azure Data Lake Analytics (ADLA) with U-SQL, using an external data source to "query data where it lives", ie in the warehouse.

// Create external table which is in SQL DW
CREATE EXTERNAL TABLE IF NOT EXISTS adlaExt.mycsv
(
    colA        int,
    colB        int,
    colC        int
)
FROM ds_adw LOCATION "dbo.mycsv";

4) Query your external table in U-SQL, eg:

// Query external table
@e =
    SELECT *
    FROM dbo.mycsv;


// Join with internal table
@q =
    SELECT a.*, b.someColumn
    FROM @e AS a
            INNER JOIN
                dbo.someOtherTable AS b
            ON a.colA == b.n_colA;


// Output it
OUTPUT @q TO "/output/output.csv"
USING Outputters.Csv();

Optionally import it to ADLA. There's a great blog post by Jorg Klein on the federated query setup in ADLA here.

IMHO this is much safer than creating a custom extactor as you are using native Azure components. Polybase does not yet support ADLA but almost certainly will do at some point in the future, at which point the design could be simplified.