Optimizing spark AWS GLUE jobs

Question

I'm reading 8 tables from Aurora postgres using pyspark AWS GLUE and after transformation and joins writing to one table in redshift of size around 2-5 GB, with read table sizes as below

92GB, 20 GB, 68 MB, 50 MB ,8 mb, 7 mb, 6 mb, 1.5 mb, 88kb, 56 kb,

No: of Standard worker node 10 concurrency between 1-3 (if in anyway it is helpful)

Reading 2 big table applying filtering while fetching from postgress. Trying to apply kryoSerializer for glue job (will this help?) if yes how can we apply and verify?

billing_fee_df= glueContext.read.format("jdbc")\
    .option("driver", "org.postgresql.Driver")\
    .option("url", "jdbc:postgresql://prod-re.cbn6.us-west-2.rds.amazonaws.com:5432/db")\
    .option("dbtable", "("sql query with specific column selection" from first_largest_table cc LEFT JOIN second_largest_table cf ON cc.id = cf.id LEFT JOIN thirdTable con ON cf.id=con.id where cc.maturity_date > ((current_date-45)::timestamp without time zone at time zone 'UTC')) as bs")\
    .option("user", "postgres")\
    .option("password", "pass")\
    .option("numPartitions", "100")\
    .option("partitionColumn", "date_col")\
    .option("lowerBound", "2020-07-30 00:00:00.000")\
    .option("upperBound", "2020-12-11 00:00:00.000").load()

below are the optimizations i'm already implementing

trying to implement Broadcast on all the smaller tables.
doing column pruning.

my job is currently finishing in 20 min. I'm looking for suggestions how to improve performance to finish the job in lessor time while considering cost aspects.

Any suggestions and questions are appreciated.

Andrew Long Andrew Long · Accepted Answer · 2020-12-20T00:09:53

You probably need to take a step back and understand where your job is spending most of its time. Is the initial read from postgres the limiting factor? The joins and computation afterwords? The write to redshift? The spark history server is the go to place to start getting this information. click on the sql tab and look at the execution graph and how long each stage took to complete. Also while you're here see if there is any skew. Also click on the details section and get the query plan and paste that above.

Optimizing spark AWS GLUE jobs

1 Answers