0
votes

I use ADF to ingest the data from SQL server to ADLS GEN2 in a Parquet Snappy format, But the size of the file in sink goes upto 120 GB, The size causes me a lot of problem when I read this file in Spark and join the data from this file with many other Parquet files.

I am thinking to use Delta lake's unmanage table with the location pointing to the ADLS location, I am able to create an UnManaged table if I don't specify any partition using this

" CONVERT TO DELTA parquet.PATH TO FOLDER CONTAINING A PARQUET FILE(S)"

But if I would want to partition this file for query optimization

" CONVERT TO DELTA parquet.PATH TO FOLDER CONTAINING A PARQUET FILE(S), PARTITIONED_COLUMN DATATYPE"

It gives me error like the one mentioned in the screenshot (find the attachment).

Error in Text :- org.apache.spark.sql.AnalysisException: Expecting 1 partition column(s): [<PARTITIONED_COLUMN>], but found 0 partition column(s): [] from parsing the file name: abfss://[email protected]/level1/Level2/Table1.parquet.snappy;

There is no way that I can create this Parquet file using ADF with partition details (Am open for suggestions)

Am I giving a wrong Syntax or this can be even done?

1

1 Answers

0
votes

Ok, I found the answer to this. While you convert parquet files to delta using the above approach, Delta would look for the correct directory structure with partition information along with the name of the column mentioned in "Partitioned By" clause.

For E.g, I have a folder called /Parent, inside this I have a directory structure with partition information, the partitioned parquet files are kept one level further inside the partitioned folders, the folder names are like this

/Parent/Subfolder=0/part-00000-62ef2efd-b88b-4dd1-ba1e-3a146e986212.c000.snappy.parquet /Parent/Subfolder=1/part-00000-fsgvfabv-b88b-4dd1-ba1e-3a146e986212.c000.snappy.parquet /Parent/Subfolder=2/part-00000-fbfdfbfe-b88b-4dd1-ba1e-3a146e986212.c000.snappy.parquet /Parent/Subfolder=3/part-00000-gbgdbdtb-b88b-4dd1-ba1e-3a146e986212.c000.snappy.parquet

in this case, subfolder is the partitions created inside parent.

CONVERT TO DELTA parquet./Parent/ partitioned by (Subfolder INT)

will just take this directory structure and convert the whole partitioned data to delta and will store the partitioned information in metastore.

Summary:- This command is only to utilize already created partitioned Parquet files. To create partition on single Parquet file you would have to take different route, Which I can explain you later if you are interested ;)