I am new to using azure data lake store and azure analytics.
Question
Is there a way to get the hash of the file (or files) stored in Azure Datalake store ? So that I can analyze whether data has changed
I have a bunch of input files stored with a similar structure
/Input/
- Client-01/
- Product-A/
- Input01.csv
/Input/
- Client-02/
- Product-A/
- Input01.csv
- Input02.csv
What have I tried
Part 01
I was able to get the Get-FileHash
locally but could NOT find anything for ADLS or anything remotely similar to this
Get-FileHash
"Input/Client-01/*.csv" -Algorithm MD5 | ConvertTo-Json >> statistics.json
to generate hashes like
[
{
"Algorithm": "MD5",
"Hash": "BA961B4B72DC602C2D2CA2B13EFC09DB",
"Path": "Input/Client-01/Input01.csv"
},
{
"Algorithm": "MD5",
"Hash": "B0528707D4E689EEEFE1AA1811063014",
"Path": "Input/Client-02/Input01.csv"
},
{
"Algorithm": "MD5",
"Hash": "60D71494355E7EE941782F1BE2969F3C",
"Path": "Input/Client-02/Input02.csv"
}
]
Part 02
I was able to get some more details using
Get-AzureRmDataLakeStoreChildItem -Account
$datalakeStoreName -Path
$path | ConvertTo-Json
which results in
{
"LastWriteTime": "\/Date(1534185132238)\/",
"LastAccessTime": "\/Date(1534185132180)\/",
"Expiration": null,
"Name": "Input01.csv",
"Path": "/Input/Client-01/",
"AccessTime": 1534185132180,
"BlockSize": 268435456,
"ChildrenNum": null,
"ExpirationTime": null,
"Group": "e045d366-777b-4e7a-a01d-79dbf0e28a61",
"Length": 127,
"ModificationTime": 1534185132238,
"Owner": "3bb6c9c4-da61-4cc2-b6ef-f4739adafff5",
"PathSuffix": "Input01.csv",
"Permission": "770",
"Type": 0,
"AclBit": true
}
Drawbacks :
- there is no hash :(
- running this on a schedule would involve something like a batch service on data factory (its technically not a draw back, but it was for me as I am invested in batch services yet..)
Part 3 : using ADLS nuget
the ADLS nuget supports few endpoints. I was specially looking at DirectoryEntry
however the model did not expose BlockSize
available in other endpoints :(
private static void PrintDirectoryEntry(DirectoryEntry entry)
{
Console.WriteLine($"Name: {entry.Name}");
Console.WriteLine($"FullName: {entry.FullName}");
Console.WriteLine($"Length: {entry.Length}");
Console.WriteLine($"Type: {entry.Type}");
Console.WriteLine($"User: {entry.User}");
Console.WriteLine($"Group: {entry.Group}");
Console.WriteLine($"Permission: {entry.Permission}");
Console.WriteLine($"Modified Time: {entry.LastModifiedTime}");
Console.WriteLine($"Last Accessed Time: {entry.LastAccessTime}");
Console.WriteLine();
}
Part 4 : using webHDFS API (Somewhat worked)
https://docs.microsoft.com/en-us/rest/api/datalakestore/webhdfs-filesystem-apis
I was able to use the op=LISTSTATUS
documentation link to get FileStatuses
which has both blocksize
and length
. so this is somewhat helpful
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 427
{
"FileStatuses":
{
"FileStatus":
[
{
"accessTime" : 1320171722771,
"blockSize" : 33554432,
"group" : "supergroup",
"length" : 24930,
"modificationTime": 1320171722771,
"owner" : "webuser",
"pathSuffix" : "a.patch",
"permission" : "644",
"replication" : 1,
"type" : "FILE"
},
{
"accessTime" : 0,
"blockSize" : 0,
"group" : "supergroup",
"length" : 0,
"modificationTime": 1320895981256,
"owner" : "username",
"pathSuffix" : "bar",
"permission" : "711",
"replication" : 0,
"type" : "DIRECTORY"
},
...
]
}
}