I have data in an S3 bucket containing many json-files that looks somewhat like this:
s3://bucket1/news/year=2018/month=01/day=01/hour=xx/
The day partition contains multiple hour=xx partitions, one for each hour of the day. I run a Glue ETL job on the files in the day partition and create a Glue dynamic_frame_from_options. I then apply some mapping using ApplyMapping.apply which works like a charm.
However, I would then like to create a new column, containing the hour value, based on the partition of each file. I can use Spark to create a new column with a constant, however, how do I make this column to use the partition as a source?
df1 = dynamicFrame.toDF().withColumn("update_date", lit("new column value"))
Edit1
The article from AWS on how to use partitioned data uses a Glue crawler before the creation of a dynamicFrame and then create the dynamicFrame from a Glue catalog. I need to create the dynamicFrame directly from the S3 source.
enter link description here
dynamicFramefrom a Glue catalog. I need to create thedynamicFramedirectly from the S3 source. - Cactus