Given an application converting csv to parquet (from and to S3) with little transformation:
for table in tables:
df_table ='csv') \
.option("header", "true") \
.option("escape", "\"") \
df_one_seven_thirty_days = df_table \
(df_table['date'] == fn.to_date(fn.lit(one_day))) \
| (df_table['date'] == fn.to_date(fn.lit(seven_days))) \
| (df_table['date'] == fn.to_date(fn.lit(thirty_days)))
for i in df_one_seven_thirty_days.schema.names:
df_one_seven_thirty_days = df_one_seven_thirty_days.withColumnRenamed(i, colrename(i).lower())
df_sql = spark.sql("SELECT * FROM "+table)
df_sql.write \
.mode("overwrite").format('parquet') \
.partitionBy("customer_id", "date") \
.option("path", path) \
I'm facing a difficulty with spark EMR.
On local with spark submit, this has no difficulties running (140MB of data) and quite fast. But on EMR, it's another story.
the first "adwords_table" will be converted without problems but the second one stays idle.
I've gone through the spark jobs UI provided by EMR and I noticed that once this task is done:
Listing leaf files and directories for 187 paths:
and 20min later nothing more happens. All the tasks are on "Completed" and no new ones are starting. I'm waiting for the saveAsTable to start.
My local machine is 8 cores 15GB and the cluster is made of 10 nodes r3.4xlarge: 32 vCore, 122 GiB memory, 320 SSD GB storage EBS Storage:200 GiB
The configuration is using maximizeResourceAllocation
true and I've only change the --num-executors / --executor-cores to 5
Does any know why the cluster goes into "idle" and don't finishes the task? (it'll eventually crashes without errors 3 hours later)
EDIT: I made few progress by removing all glue catalogue connections + downgrading hadoop to use: hadoop-aws:2.7.3
Now the saveAsTable is working just fine, but once it finishes, I see the executors being removed and the cluster is idle, the step doesn't finish.
Thus my problem is still the same.