0
votes

I have tsv file in azure datalake, which has below fields.

paperId, language_code

I need to come up with a file with below fields

language_id, language_code

where language_id is a unique id generated for each language code.

To do this I wrote a UDO. I followed article https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-develop-user-defined-operators.

using Microsoft.Analytics.Interfaces;
using System.Collections.Generic;

namespace USQL_UDO
{
   public class LanguageCode : IProcessor
   {
       private static IDictionary<string, string> languageCodeID = new Dictionary<string, string>
       public override IRow Process(IRow input, IUpdatableRow output)
       {
            string UserID = input.Get<long>("PaperId");
            string LanguageCode = input.Get<string>("LanguageCode");
            string Language_id = "";

            if (languageCodeID.Keys.Contains(LanguageCode))
            {
                Language_id = languageCodeID[LanguageCode];
            }else
            {
                Language_id = GetTimestamp(DateTime.Now);
                languageCodeID[LanguageCode] = Language_id;
            }
            output.Set<string>(0, Language_id);
            output.Set<string>(1, LanguageCode);

            return output.AsReadOnly();
    }

    public static String GetTimestamp(this DateTime value)
    {
        return value.ToString("yyyyMMddHHmmssfff");
    }

   }
}

But I cannot figure out a way to refer this in my usql script. I cannot use visual studio as I'm working on a linux environment. Is there a way to refer the custom class in usql query.

I'm very new to usql and azure. I might be doing it in the complete non-sensible way.

My usql script is this.

@inputA =
EXTRACT 
    PaperId long,
    LanguageCode string


FROM "/graph/2018-04-13/PaperLanguages.txt"
USING Extractors.Tsv(quoting : false);

@parsed_language =
     PROCESS @inputA
     PRODUCE Language_id string,
             LanguageCode string
     USING new USQL_UDO.LanguageCode();


OUTPUT @parsed_language
     TO "/output/parsedData/mag2__language.csv"
     USING Outputters.Text(outputHeader : true, quoting : false, delimiter: '~');
1

1 Answers

0
votes

Could you use the VS Code ADL tooling from Linux instead?

In the worst case, you would compile your code and upload the dll into your Azure Data Lake Store or Azure Storage account and then register it with CREATE ASSEMBLY. Then in your U-SQL script, you bring in your code with a REFERENCE ASSEMBLY statement.

Some examples are here: https://blogs.msdn.microsoft.com/azuredatalake/2016/08/26/how-to-register-u-sql-assemblies-in-your-u-sql-catalog/