I have a large set of data that spans a couple hundred files. Apparently, it's got a few encoding issues in it (it's mostly UTF-8, but apparently some characters just aren't valid). According to https://msdn.microsoft.com/en-us/library/azure/mt764098.aspx if there is an encoding error, a runtime error will occur regardless of setting the silent flag to true (with the aim of just skipping erroring rows).
As a result, I need to write a custom extractor. I've written one that largely does a simplified version of the example at https://blogs.msdn.microsoft.com/data_otaku/2016/10/27/a-fixed-width-extractor-for-azure-data-lake-analytics/ in that it just takes a row, splits it by the delimiter and just returns the values within a try block. If there are any exceptions, I just handle them and move on.
Unfortunately, I'm having an issue actually referencing this extractor in the USQL script itself. When I follow the guidance on the above link, it suggests writing the logic in another assembly, building that, registering it in the ADLS database/assemblies and then including it via REFERENCE ASSEMBLY MyExtractors;
at the top of the script (as that is the namespace used). In the below Using statement, I call it with USING new SimpleExtractor();
If I do so, I get an error when running the script against the ADLS service that the type or namespace cannot be found
. Additionally, if I attempt to be more precise and use USING new MyExtractors.SimpleExtractor();
in the using statement, it yields the same error, citing the USING statement above.
I then found additional documentation in an older source at https://azure.microsoft.com/en-us/documentation/articles/data-lake-analytics-u-sql-develop-user-defined-operators/ that describes doing the same thing but in the code-behind file. I deleted the separate assembly and copied the logic into a class in that file. The example in step #6 doesn't show any REFERENCE ASSEMBLY
statements, but again, when I run it, I get an error that the type or namespace name cannot be found
.
Looking at the most recent release notes in hopes that something is just out of date here, the only thing I see is that if I use a USING
statement , I need a reference to the custom code's assembly (as in the first attempt) prior to actually using it, which I am.
Can anyone please provide some guidance on how to properly reference UDOs in USQL, or otherwise indicate how to have the runtime handle Encoding exceptions silently (and just skip them)?
Here's what my logic is looking like in the extractor itself:
using System.Collections.Generic;
using System.IO;
using System.Text;
using Microsoft.Analytics.Interfaces;
namespace Utilities
{
[SqlUserDefinedExtractor(AtomicFileProcessing = true)]
public class ModifiedTextExtractor : IExtractor
{
//Contains the row
private readonly Encoding _encoding;
private readonly byte[] _row_delim;
private readonly char _col_delim;
public ModifiedTextExtractor()
{
_encoding = Encoding.UTF8;
_row_delim = _encoding.GetBytes("\r\n");
_col_delim = '\t';
}
public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
{
//Read the input line by line
foreach (var current in input.Split(_row_delim))
{
using (var reader = new StreamReader(current, this._encoding))
{
var line = reader.ReadToEnd().Trim();
//If there are any single or double quotes in the line, escape them
line = line.Replace(@"""", @"\""");
var count = 0;
//Split the input by the column delimiter
var parts = line.Split(_col_delim);
foreach (var part in parts)
{
output.Set<string>(count, part);
count += 1;
}
}
yield return output.AsReadOnly();
}
}
}
}
And a snippet of how I'm trying to use it in the USQL statement (after registering it as an assembly):
REFERENCE ASSEMBLY [Utilities];
CREATE VIEW MyView AS ...
USING new Utilities.ModifiedTextExtractor();
Thank you!