2
votes

I have a lot of avro files with IPs (weblogs) stored in Azure Blobs. I would like to map the IP to location. How to do that with Azure Data Lake Analytics (ADLA)?

Right now I have a spark job that uses the Maxmind IP database with a java library that reads a 113MB big .mmdb file with all the ip-location data to do this lookup. I am now investigating if its possible to move this job over to ADLA

Maxmind provies a c# library as well, so that part is no problem. However, its not obvious to me how to deal with this big mmdb file that needs to be read and then used for the lookups. Obviously it would not be fast to read the file for each IP lookup. How can this (and similar cases) be handled with ADLA, or is ADLA unsuitable for this kinds of operations?

If I had a normal program running I would do the lookup like this:

using (var reader = new Reader("GeoIP2-City.mmdb"))
{
    foreach(var ip in ips)
    {
        var data = reader.Find<Dictionary<string, object>>(ip);
        ...
    }
}

The maxmind db is available here: https://dev.maxmind.com/geoip/geoip2/downloadable/ (note that I have bought the database Im currently using) and the c# library to read it here: https://github.com/maxmind/MaxMind-DB-Reader-dotnet

1
Honestly this sounds like a straightforward database operation - join, aka lookup, but I don't know anything about the Maxmind data. Can you tell us more? eg have you paid for and downloaded the file?wBob
Expanded the answer a bit with how the database would be used in a "normal" command line programviblo
Have you considered using ADLA table to store the large file, then use as you would a lookup table? You can join your input file directly to and ADLA table, based on the ip (assuming they have the same format) similar to natural key lookups.jatal
Not really. I think that I would then need to reimplement the db reader from MaxMind, and also would get very bad performance compared to the optimized lookup the mmdb file format provides. (However, I have already moved on to a normal webjob reading/writing to eventhubs to do the mapping, so I have not explored more ADLA options for this case)viblo

1 Answers

1
votes

You can use U-SQL's DEPLOY RESOURCE statement and a UDO for this purpose.

First, the file must be uploaded to you datalake store. Then use DEPLOY RESOURCE to tell the U-SQL system to copy that file to each vertex where the script runs. Then your script uses C# code to read the file.

DEPLOY RESOURCE "/helloworld.txt";

@departments =
  SELECT * 
  FROM (VALUES
      (31, "Sales"),
      (33, "Engineering"),
      (34, "Clerical"),
      (35, "Marketing")
    ) AS D( DepID, DepName );


@departments =
     PROCESS @departments
     PRODUCE DepID int,
             DepName string,
             HelloWorld string
     USING new Demo.HelloWorldProcessor();

OUTPUT @departments 
    TO "/departments.tsv"
    USING Outputters.Tsv();

And here is the U-SQL Processor UDO.

using Microsoft.Analytics.Interfaces;
using Microsoft.Analytics.Types.Sql;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;

namespace Demo
{
    [SqlUserDefinedProcessor]
    public class HelloWorldProcessor : IProcessor
    {
        private string hw;

        public HelloWorldProcessor()
        {
            this.hw = System.IO.File.ReadAllText("helloworld.txt");
        }

        public override IRow Process(IRow input, IUpdatableRow output)
        {
            output.Set<int>("DepID", input.Get<int>("DepID"));
            output.Set<string>("DepName", input.Get<string>("DepName"));
            output.Set<string>("HelloWorld", hw);
            return output.AsReadOnly();
        }
    }
}