0
votes

I am working with parquet files stored on Amazon S3. These files need to be extracted and the data from it needs to be loaded into Azure Data Warehouse.

My plan is:

Amazon S3 -> Use SAP BODS to move parquet files to Azure Blob -> Create External tables on those parquet files -> Staging -> Fact/ Dim tables

Now the problem is that in one of the parquet files there is a column that is stored as an array<string>. I am able to create external table on it using varchar data type for that column but if I perform any sql query operation (i.e. Select) on that external table then it throws below error.

Msg 106000, Level 16, State 1, Line 3

HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: ClassCastException: optional group status (LIST) {

repeated group bag {

optional binary array_element (UTF8);

}

} is not primitive

I have tried different data types but unable to run select query on that external table.

Please let me know if there are any other options.

Thanks

1

1 Answers

0
votes

On Azure, there is a service named Azure Data Factory, I think which can be used in your current scenario, as the document Parquet format in Azure Data Factory said below.

Parquet format is supported for the following connectors: Amazon S3, Azure Blob, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure File Storage, File System, FTP, Google Cloud Storage, HDFS, HTTP, and SFTP.

And you can try to follow the tutorial Load data into Azure SQL Data Warehouse by using Azure Data Factory to set Amazon S3 with parquet format as source to directly copy data to Azure SQL Data Warehouse. Due to read the data from the parquet format file with auto schema parsering, it should be easy for your task using Azure Data Factory.

Hope it helps.