0
votes

We have many different pipelines in Azure data factory with many data sets. Mainly we have data sets of Azure data lake store and Azure Blobs. I want to know the file size of all files (from all datasets of all pipelines). I am able to iterate all the datasets from all the pipeline using DataFactoryManagementClient in C# but when I am trying to see fileName or folderName of the dataset, I am getting null. You can see my below code -

private static void GetDataSetSize(DataFactoryManagementClient dataFactoryManagementClient) { string resourceGroupName = "resourceGroupName";

        foreach (var dataFactory in dataFactoryManagementClient.DataFactories.List(resourceGroupName).DataFactories)
        {
            var linkedServices = new List<LinkedService>(dataFactoryManagementClient.LinkedServices.List(resourceGroupName, dataFactory.Name).LinkedServices);
            var datasets = dataFactoryManagementClient.Datasets.List(resourceGroupName, dataFactory.Name).Datasets;

            foreach (var dataset in datasets)
            {

                var lsTypeProperties = linkedServices.First(ls => ls.Name == dataset.Properties.LinkedServiceName).Properties.TypeProperties;

                if(lsTypeProperties.GetType() == typeof(AzureDataLakeStoreLinkedService))//AzureDataLakeStoreLinkedService))
                {
                    AzureDataLakeStoreLinkedService outputLinkedService = lsTypeProperties as AzureDataLakeStoreLinkedService;
                    var folder = GetBlobFolderPathDL(dataset);
                    var file = GetBlobFileNameDL(dataset);

                }
            }

        }
    }
    public static string GetBlobFolderPathDL(Dataset dataset)
    {
        if (dataset == null || dataset.Properties == null)
        {
            return string.Empty;
        }

        AzureDataLakeStoreDataset dlDataset = dataset.Properties.TypeProperties as AzureDataLakeStoreDataset;
        if (dlDataset == null)
        {
            return string.Empty;
        }

        return dlDataset.FolderPath;
    }

    public static string GetBlobFileNameDL(Dataset dataset)
    {
        if (dataset == null || dataset.Properties == null)
        {
            return string.Empty;
        }

        AzureDataLakeStoreDataset dlDataset = dataset.Properties.TypeProperties as AzureDataLakeStoreDataset;
        if (dlDataset == null)
        {
            return string.Empty;
        }

        return dlDataset.FileName;
    }

With this, I want to generate monitoring tool which will tell me how data is increasing for each file/dataset?

FYI - I am going to monitor retries, failures of each slice. I can get this information without any issue, but now the problem is about getting the file name and folder path because it's returning me null(It seems to be a bug in API). Once I have folder and file path, then using DataLakeStoreFileSystemManagementClient I will get the file size of those files. I am planning to ingest all this data (size, fileName, retries, failure etc) into SQL database and on top of it - I will generate reports which will tell me how my data is growing daily or hourly etc.

I want to make it generic, in such a way that - if in future I add new dataset or pipeline - I get the size of all newly added datasets also without changing any code.

Please help me how can I achieve this. Suggest me if there is an alternate way if possible.

1

1 Answers

0
votes

Just place this code in your main method and execute.You may able to see your datasets folderpath and filenames.Use this and change accordingly to your requirement.

Hope this helps!

         foreach (var dataFactory in dataFactoryManagementClient.DataFactories.List(resourceGroupName).DataFactories)
    {
     var datasets = dataFactoryManagementClient.Datasets.List(resourceGroupName, dataFactory.Name).Datasets;
        foreach (var dataset in datasets)
        {

            var lsTypeProperties = dataFactoryManagementClient.Datasets.Get(resourceGroupName,dataFactory.Name,dataset.Name);

            if (lsTypeProperties.Dataset.Properties.TypeProperties.GetType() == typeof(AzureDataLakeStoreDataset))//AzureDataLakeStoreDataset))
            {
                AzureDataLakeStoreDataset OutputDataSet = lsTypeProperties.Dataset.Properties.TypeProperties as AzureDataLakeStoreDataset;
                Console.WriteLine(OutputDataSet.FolderPath);
                Console.WriteLine(OutputDataSet.FileName);
                Console.ReadKey();
            }
        }

    }