2
votes

I am new to using azure data lake store and azure analytics.

Question

Is there a way to get the hash of the file (or files) stored in Azure Datalake store ? So that I can analyze whether data has changed

I have a bunch of input files stored with a similar structure

  • /Input/

    • Client-01/
    • Product-A/
      • Input01.csv
  • /Input/

    • Client-02/
    • Product-A/
      • Input01.csv
      • Input02.csv

What have I tried

Part 01

I was able to get the Get-FileHash locally but could NOT find anything for ADLS or anything remotely similar to this

Get-FileHash 
   "Input/Client-01/*.csv" -Algorithm MD5 | ConvertTo-Json >> statistics.json

to generate hashes like

[
    {
        "Algorithm":  "MD5",
        "Hash":  "BA961B4B72DC602C2D2CA2B13EFC09DB",
        "Path":  "Input/Client-01/Input01.csv"
    },
    {
        "Algorithm":  "MD5",
        "Hash":  "B0528707D4E689EEEFE1AA1811063014",
        "Path":  "Input/Client-02/Input01.csv"
    },
    {
        "Algorithm":  "MD5",
        "Hash":  "60D71494355E7EE941782F1BE2969F3C",
        "Path":  "Input/Client-02/Input02.csv"
    }
]

Part 02

I was able to get some more details using

Get-AzureRmDataLakeStoreChildItem -Account 
   $datalakeStoreName -Path 
   $path | ConvertTo-Json

which results in

{
    "LastWriteTime":  "\/Date(1534185132238)\/",
    "LastAccessTime":  "\/Date(1534185132180)\/",
    "Expiration":  null,
    "Name":  "Input01.csv",
    "Path":  "/Input/Client-01/",
    "AccessTime":  1534185132180,
    "BlockSize":  268435456,
    "ChildrenNum":  null,
    "ExpirationTime":  null,
    "Group":  "e045d366-777b-4e7a-a01d-79dbf0e28a61",
    "Length":  127,
    "ModificationTime":  1534185132238,
    "Owner":  "3bb6c9c4-da61-4cc2-b6ef-f4739adafff5",
    "PathSuffix":  "Input01.csv",
    "Permission":  "770",
    "Type":  0,
    "AclBit":  true
}

Drawbacks :

  • there is no hash :(
  • running this on a schedule would involve something like a batch service on data factory (its technically not a draw back, but it was for me as I am invested in batch services yet..)

Part 3 : using ADLS nuget

the ADLS nuget supports few endpoints. I was specially looking at DirectoryEntry however the model did not expose BlockSize available in other endpoints :(

https://github.com/Azure-Samples/data-lake-store-adls-dot-net-get-started/blob/master/AdlsSDKGettingStarted/Program.cs

private static void PrintDirectoryEntry(DirectoryEntry entry)
        {
            Console.WriteLine($"Name: {entry.Name}");
            Console.WriteLine($"FullName: {entry.FullName}");
            Console.WriteLine($"Length: {entry.Length}");
            Console.WriteLine($"Type: {entry.Type}");
            Console.WriteLine($"User: {entry.User}");
            Console.WriteLine($"Group: {entry.Group}");
            Console.WriteLine($"Permission: {entry.Permission}");
            Console.WriteLine($"Modified Time: {entry.LastModifiedTime}");
            Console.WriteLine($"Last Accessed Time: {entry.LastAccessTime}");
            Console.WriteLine();
        }

Part 4 : using webHDFS API (Somewhat worked)

https://docs.microsoft.com/en-us/rest/api/datalakestore/webhdfs-filesystem-apis

I was able to use the op=LISTSTATUSdocumentation link to get FileStatuses which has both blocksize and length. so this is somewhat helpful

HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 427

{
  "FileStatuses":
  {
    "FileStatus":
    [
      {
        "accessTime"      : 1320171722771,
        "blockSize"       : 33554432,
        "group"           : "supergroup",
        "length"          : 24930,
        "modificationTime": 1320171722771,
        "owner"           : "webuser",
        "pathSuffix"      : "a.patch",
        "permission"      : "644",
        "replication"     : 1,
        "type"            : "FILE"
      },
      {
        "accessTime"      : 0,
        "blockSize"       : 0,
        "group"           : "supergroup",
        "length"          : 0,
        "modificationTime": 1320895981256,
        "owner"           : "username",
        "pathSuffix"      : "bar",
        "permission"      : "711",
        "replication"     : 0,
        "type"            : "DIRECTORY"
      },
      ...
    ]
  }
}
1

1 Answers

0
votes

Are you looking to identify if the file has changed or actually identify the rows within the file that has changed? if you want to identify row changes then use an ADLA job to run a U-SQL Script or function That creates a row hash.

If you want to identify if the file has changed, I suspect you would need to run a job to loop through all of the files and for each file generate a hash. Then you could store this value in either another file or a table where you maintain a list of the files and the historical hash values. It's not going to be a single step process Azure Data Factory or a PowerShell Runbook would be the best way to orchestrate this process.