2
votes

Given an application converting csv to parquet (from and to S3) with little transformation:

for table in tables:
    df_table = spark.read.format('csv') \
            .option("header", "true") \
            .option("escape", "\"") \
            .load(path)

    df_one_seven_thirty_days = df_table \
            .filter(
                (df_table['date'] == fn.to_date(fn.lit(one_day))) \
                    | (df_table['date'] == fn.to_date(fn.lit(seven_days))) \
                    | (df_table['date'] == fn.to_date(fn.lit(thirty_days)))
            )

    for i in df_one_seven_thirty_days.schema.names:
        df_one_seven_thirty_days = df_one_seven_thirty_days.withColumnRenamed(i, colrename(i).lower())
    df_one_seven_thirty_days.createOrReplaceTempView(table)

   df_sql = spark.sql("SELECT * FROM "+table)
   df_sql.write \
        .mode("overwrite").format('parquet') \
        .partitionBy("customer_id", "date") \
        .option("path", path) \
        .saveAsTable(adwords_table)

I'm facing a difficulty with spark EMR.

On local with spark submit, this has no difficulties running (140MB of data) and quite fast. But on EMR, it's another story.

the first "adwords_table" will be converted without problems but the second one stays idle.

I've gone through the spark jobs UI provided by EMR and I noticed that once this task is done:

Listing leaf files and directories for 187 paths:

Spark kills all executors: enter image description here

and 20min later nothing more happens. All the tasks are on "Completed" and no new ones are starting. I'm waiting for the saveAsTable to start.

My local machine is 8 cores 15GB and the cluster is made of 10 nodes r3.4xlarge: 32 vCore, 122 GiB memory, 320 SSD GB storage EBS Storage:200 GiB

The configuration is using maximizeResourceAllocation true and I've only change the --num-executors / --executor-cores to 5

Does any know why the cluster goes into "idle" and don't finishes the task? (it'll eventually crashes without errors 3 hours later)

EDIT: I made few progress by removing all glue catalogue connections + downgrading hadoop to use: hadoop-aws:2.7.3

Now the saveAsTable is working just fine, but once it finishes, I see the executors being removed and the cluster is idle, the step doesn't finish.

Thus my problem is still the same.

2
What is the exact s3 path you are trying to write to?sramalingam24
path = "s3a://{0}/{1}/{2}".format(S3_DESTINATION_RAW_BUCKET, S3_PROCESSED_ADWORDS_PATH, adwords_table)Jay Cee
Have you tried running them separately instead of a loopsramalingam24

2 Answers

0
votes

What I found out after many tries and headaches is that the cluster is still running / processing. It is actually trying to write the data, but only from the master node.

Surprisingly enough, this won't be showing on the UI and it gives an impression of being idle.

The writing is taking few hours, no matter what I do (repartition(1), bigger cluster, etc).

The main problem here is the saveAsTable, I have no clue what it is doing that takes so long or make the writing so slow.

Thus I went for the write.parquet("hdfs:///tmp_loc") locally on the cluster and then processed to use the aws s3-dist-cp from the hdfs to the s3 folder.

The performance are outstanding, I went from a saveAsTable (taking 3 to 5 hours to write 17k rows / 120MB) to 3min.

As the data / schema might change at some point, I just execute a glue save from a sql request.

0
votes

I am also facing the same issue, is the issue related the new version of EMR 5.27? For me also job is getting stuck for one executor for very long.It completes all 99% executor and this happens while reading the files.