pyspark spark 2.4 on EMR 5.27 - cluster stop processing after listing files

Question

Given an application converting csv to parquet (from and to S3) with little transformation:

for table in tables:
    df_table = spark.read.format('csv') \
            .option("header", "true") \
            .option("escape", "\"") \
            .load(path)

    df_one_seven_thirty_days = df_table \
            .filter(
                (df_table['date'] == fn.to_date(fn.lit(one_day))) \
                    | (df_table['date'] == fn.to_date(fn.lit(seven_days))) \
                    | (df_table['date'] == fn.to_date(fn.lit(thirty_days)))
            )

    for i in df_one_seven_thirty_days.schema.names:
        df_one_seven_thirty_days = df_one_seven_thirty_days.withColumnRenamed(i, colrename(i).lower())
    df_one_seven_thirty_days.createOrReplaceTempView(table)

   df_sql = spark.sql("SELECT * FROM "+table)
   df_sql.write \
        .mode("overwrite").format('parquet') \
        .partitionBy("customer_id", "date") \
        .option("path", path) \
        .saveAsTable(adwords_table)

I'm facing a difficulty with spark EMR.

On local with spark submit, this has no difficulties running (140MB of data) and quite fast. But on EMR, it's another story.

the first "adwords_table" will be converted without problems but the second one stays idle.

I've gone through the spark jobs UI provided by EMR and I noticed that once this task is done:

Listing leaf files and directories for 187 paths:

Spark kills all executors:

and 20min later nothing more happens. All the tasks are on "Completed" and no new ones are starting. I'm waiting for the saveAsTable to start.

My local machine is 8 cores 15GB and the cluster is made of 10 nodes r3.4xlarge: 32 vCore, 122 GiB memory, 320 SSD GB storage EBS Storage:200 GiB

The configuration is using maximizeResourceAllocation true and I've only change the --num-executors / --executor-cores to 5

Does any know why the cluster goes into "idle" and don't finishes the task? (it'll eventually crashes without errors 3 hours later)

EDIT: I made few progress by removing all glue catalogue connections + downgrading hadoop to use: hadoop-aws:2.7.3

Now the saveAsTable is working just fine, but once it finishes, I see the executors being removed and the cluster is idle, the step doesn't finish.

Thus my problem is still the same.

path = "s3a://{0}/{1}/{2}".format(S3_DESTINATION_RAW_BUCKET, S3_PROCESSED_ADWORDS_PATH, adwords_table) — Jay Cee

Jay Cee Jay Cee · Accepted Answer · 2019-10-17T10:50:21

What I found out after many tries and headaches is that the cluster is still running / processing. It is actually trying to write the data, but only from the master node.

Surprisingly enough, this won't be showing on the UI and it gives an impression of being idle.

The writing is taking few hours, no matter what I do (repartition(1), bigger cluster, etc).

The main problem here is the saveAsTable, I have no clue what it is doing that takes so long or make the writing so slow.

Thus I went for the write.parquet("hdfs:///tmp_loc") locally on the cluster and then processed to use the aws s3-dist-cp from the hdfs to the s3 folder.

The performance are outstanding, I went from a saveAsTable (taking 3 to 5 hours to write 17k rows / 120MB) to 3min.

As the data / schema might change at some point, I just execute a glue save from a sql request.

pyspark spark 2.4 on EMR 5.27 - cluster stop processing after listing files

2 Answers