We have many different pipelines in Azure data factory with many data sets. Mainly we have data sets of Azure data lake store and Azure Blobs. I want to know the file size of all files (from all datasets of all pipelines). I am able to iterate all the datasets from all the pipeline using DataFactoryManagementClient in C# but when I am trying to see fileName or folderName of the dataset, I am getting null. You can see my below code -
private static void GetDataSetSize(DataFactoryManagementClient dataFactoryManagementClient) { string resourceGroupName = "resourceGroupName";
foreach (var dataFactory in dataFactoryManagementClient.DataFactories.List(resourceGroupName).DataFactories)
{
var linkedServices = new List<LinkedService>(dataFactoryManagementClient.LinkedServices.List(resourceGroupName, dataFactory.Name).LinkedServices);
var datasets = dataFactoryManagementClient.Datasets.List(resourceGroupName, dataFactory.Name).Datasets;
foreach (var dataset in datasets)
{
var lsTypeProperties = linkedServices.First(ls => ls.Name == dataset.Properties.LinkedServiceName).Properties.TypeProperties;
if(lsTypeProperties.GetType() == typeof(AzureDataLakeStoreLinkedService))//AzureDataLakeStoreLinkedService))
{
AzureDataLakeStoreLinkedService outputLinkedService = lsTypeProperties as AzureDataLakeStoreLinkedService;
var folder = GetBlobFolderPathDL(dataset);
var file = GetBlobFileNameDL(dataset);
}
}
}
}
public static string GetBlobFolderPathDL(Dataset dataset)
{
if (dataset == null || dataset.Properties == null)
{
return string.Empty;
}
AzureDataLakeStoreDataset dlDataset = dataset.Properties.TypeProperties as AzureDataLakeStoreDataset;
if (dlDataset == null)
{
return string.Empty;
}
return dlDataset.FolderPath;
}
public static string GetBlobFileNameDL(Dataset dataset)
{
if (dataset == null || dataset.Properties == null)
{
return string.Empty;
}
AzureDataLakeStoreDataset dlDataset = dataset.Properties.TypeProperties as AzureDataLakeStoreDataset;
if (dlDataset == null)
{
return string.Empty;
}
return dlDataset.FileName;
}
With this, I want to generate monitoring tool which will tell me how data is increasing for each file/dataset?
FYI - I am going to monitor retries, failures of each slice. I can get this information without any issue, but now the problem is about getting the file name and folder path because it's returning me null(It seems to be a bug in API). Once I have folder and file path, then using DataLakeStoreFileSystemManagementClient I will get the file size of those files. I am planning to ingest all this data (size, fileName, retries, failure etc) into SQL database and on top of it - I will generate reports which will tell me how my data is growing daily or hourly etc.
I want to make it generic, in such a way that - if in future I add new dataset or pipeline - I get the size of all newly added datasets also without changing any code.
Please help me how can I achieve this. Suggest me if there is an alternate way if possible.