1
votes

Why does an Azure SQL Data Warehouse Polybase Query to Azure Data Lake Gen 2 return many rows for a single file source, but zero rows for the parent folder source?

I created:

  • Master Key (CREATE MASTER KEY;)
  • Credential (CREATE DATABASE SCOPED CREDENTIAL) - uses the ADLS Gen 2 account key
  • External data source (CREATE EXTERNAL DATA SOURCE)
  • File format (CREATE EXTERNAL FILE FORMAT)
  • External table (CREATE EXTERNAL TABLE)

Everything works fine when my external table points to a specific file, i.e.

CREATE EXTERNAL TABLE [ext].[Time]
(
    [TimeID] int NOT NULL,
    [HourNumber] tinyint NOT NULL,
    [MinuteNumber] tinyint NOT NULL,
    [SecondNumber] tinyint NOT NULL,
    [TimeInSecond] int NOT NULL,
    [HourlyBucket] varchar(15) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL
)
WITH
(
    LOCATION = '/Time/time001.txt',
    DATA_SOURCE = ADLSDataSource,
    FILE_FORMAT = uncompressedcsv,
    REJECT_TYPE = value,
    REJECT_VALUE = 2147483647
);
SELECT * FROM [ext].[Time];

Many rows returned, therefore I am confident all items mentioned above are configured correctly.

The Time folder in Azure Data Lake Gen 2 contains many files, not just time001.txt. When I change my external table to point at a folder, and not an individual file, the query returns zero rows, i.e.

CREATE EXTERNAL TABLE [ext].[Time]
(
    [TimeID] int NOT NULL,
    [HourNumber] tinyint NOT NULL,
    [MinuteNumber] tinyint NOT NULL,
    [SecondNumber] tinyint NOT NULL,
    [TimeInSecond] int NOT NULL,
    [HourlyBucket] varchar(15) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL
)
WITH
(
    LOCATION = '/Time/',
    DATA_SOURCE = ADLSDataSource,
    FILE_FORMAT = uncompressedcsv,
    REJECT_TYPE = value,
    REJECT_VALUE = 2147483647
);
SELECT * FROM [ext].[Time];

Zero rows returned

I tried:

  • LOCATION = '/Time/',
  • LOCATION = '/Time',
  • LOCATION = 'Time/',
  • LOCATION = 'Time',

But always zero rows. I also followed the instructions at https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-load-from-azure-data-lake-store

I tested all the files within the folder and individually each returns many rows of data.

I queried all the files from Blob storage and not ADLS Gen2 and the "Folder" query returns all rows as expected.

How do I query all files in a folder "as one" from Azure Data Lake Gen2 storage using Azure SQL Data Warehouse and Polybase?

1
What do you put in the credential? Storage account key? SAS token? Service principal?GregGalloway
What are the actual folder and file names? I assume none of the files or parent folders start with “_” or “.”?GregGalloway
The credential uses the account key. Pointing at a specific file returns data, so I believe the credentials are good. The folder is "Time" and file "time001.txt", no special characters or folders or files starting "_" or ".".Andy Jones
I know it doesn't match the documentation, but have you tried adding a wildcard to the LOCATION: '/Time/*.txt'?Joel Cochran
Thanks, @JoelCochran, but using wildcards didn't fix it.Andy Jones

1 Answers

0
votes

I was facing the exactly same issue: the problem was on the Data Source protocol.

Script with error:

   CREATE EXTERNAL DATA SOURCE datasourcename
   WITH (
       TYPE = HADOOP,
       LOCATION = 'abfss://[email protected]',
       CREDENTIAL = credential_name

Script that solves issue:

   CREATE EXTERNAL DATA SOURCE datasourcename
   WITH (
       TYPE = HADOOP,
       LOCATION = 'abfss://[email protected]',
       CREDENTIAL = credential_name

The only change needed was the LOCATION.

Thanks to the Microsoft Support Team for helping me on this.