How to Include the Value of Partitioned Column in a Spark data frame or Spark SQL Temp Table in AWS Glue?

Question

I am using python 3, Glue 1.0 for this code.

I have partitioned data in S3. The data is partitioned in year,month,day,extra_field_name columns.

When I load the data into data frame, I am getting all the columns in it's schema other than the partitioned ones.

Here is the code and output

glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": path_list, "recurse" : True, 'groupFiles': 'inPartition'}, format = "parquet").toDF().registerTempTable(final_arguement_list["read_table_" + str(i+1)])

The path_list variable contains a string of list of paths that need to be loaded into a data frame. I am printing schema using the below command

glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": path_list, "recurse" : True}, format = "parquet").toDF().printSchema()

The schema that I am getting in cloudwatch logs does not contain any of the partitioned columns. Please note that I have already tried loading data by giving path by only providing path till year, month, day, extra_field_name separately but still getting only those columns which are present in the parquet files itself.

Here it is - ['s3://path/to/source/data/year=2018/month=1/day=4/', 's3://path/to/source/data/year=2018/month=1/day=5/', 's3://path/to/source/data/year=2018/month=1/day=6/'] — Rishabh Dixit

Rishabh Dixit Rishabh Dixit · Accepted Answer · 2019-12-05T11:34:45

As a workaround, I have created a duplicate column in the data frame itself named - year_2, month_2, day_2 and extra_field_name_2 as a copy of year, month, day and extra_field_name.

During data ingestion phase, I have partitioned the data frame on year, month, day and extra_field_name and stored it in S3 which retains the column value of year_2, month_2, day_2 and extra_field_name_2 in the parquet files itself.

While performing data manipulation, I am loading the data in a dynamic frame by providing the list of paths in the following manner:
['s3://path/to/source/data/year=2018/month=1/day=4/', 's3://path/to/source/data/year=2018/month=1/day=5/', 's3://path/to/source/data/year=2018/month=1/day=6/']

This gives me year_2, month_2, day_2 and extra_field_name_2 in the dynamic frame that I can further use for data manipulation.

How to Include the Value of Partitioned Column in a Spark data frame or Spark SQL Temp Table in AWS Glue?

3 Answers