I am using python 3, Glue 1.0 for this code.
I have partitioned data in S3. The data is partitioned in year,month,day,extra_field_name columns.
When I load the data into data frame, I am getting all the columns in it's schema other than the partitioned ones.
Here is the code and output
glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": path_list, "recurse" : True, 'groupFiles': 'inPartition'}, format = "parquet").toDF().registerTempTable(final_arguement_list["read_table_" + str(i+1)])
The path_list variable contains a string of list of paths that need to be loaded into a data frame. I am printing schema using the below command
glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": path_list, "recurse" : True}, format = "parquet").toDF().printSchema()
The schema that I am getting in cloudwatch logs does not contain any of the partitioned columns. Please note that I have already tried loading data by giving path by only providing path till year, month, day, extra_field_name separately but still getting only those columns which are present in the parquet files itself.
path_list
? - Oliver W.