AWS Glue ETL and PySpark and partitioned data: how to create a dataframe column from partition

Question

I have data in an S3 bucket containing many json-files that looks somewhat like this:

s3://bucket1/news/year=2018/month=01/day=01/hour=xx/

The day partition contains multiple hour=xx partitions, one for each hour of the day. I run a Glue ETL job on the files in the day partition and create a Glue dynamic_frame_from_options. I then apply some mapping using ApplyMapping.apply which works like a charm.

However, I would then like to create a new column, containing the hour value, based on the partition of each file. I can use Spark to create a new column with a constant, however, how do I make this column to use the partition as a source?

df1 = dynamicFrame.toDF().withColumn("update_date", lit("new column value"))

Edit1

The article from AWS on how to use partitioned data uses a Glue crawler before the creation of a dynamicFrame and then create the dynamicFrame from a Glue catalog. I need to create the dynamicFrame directly from the S3 source. enter link description here

Thank you, I have seen the article, however, they are using a Glue crawler before the creation of a dynamicFrame and then create the dynamicFrame from a Glue catalog. I need to create the dynamicFrame directly from the S3 source. — Cactus

TKN TKN · Accepted Answer · 2019-05-07T20:27:54

I am not really following what you need to do. Dont you already have a hour value if you have the files partitoned on it or is that only when you use create_dynamic_frame .from_catalog that you will get it? Can you do a df1["hour"] or df1.select_fields["hour"]?

You do not need to import any libs if you have your data partitoned on ts(timestamp in yyyymmddhh format), this you can perform with pure python in Spark.

Example code. First I create some values that will populate my DataFrame. Then create a new variable like below.

df_values = [('2019010120',1),('2019010121',2),('2019010122',3),('2019010123',4)]
df = spark.createDataFrame(df_values,['yyyymmddhh','some_other_values'])
df_new = df.withColumn("hour", df["yyyymmddhh"][9:10])
df_new.show()
+----------+-----------------+----+
|yyyymmddhh|some_other_values|hour|
+----------+-----------------+----+
|2019010120|                1|  20|
|2019010121|                2|  21|
|2019010122|                3|  22|
|2019010123|                4|  23|
+----------+-----------------+----+

AWS Glue ETL and PySpark and partitioned data: how to create a dataframe column from partition

4 Answers