I'm reading 8 tables from Aurora postgres using pyspark AWS GLUE and after transformation and joins writing to one table in redshift of size around 2-5 GB, with read table sizes as below
92GB, 20 GB, 68 MB, 50 MB ,8 mb, 7 mb, 6 mb, 1.5 mb, 88kb, 56 kb,
No: of Standard worker node 10 concurrency between 1-3 (if in anyway it is helpful)
Reading 2 big table applying filtering while fetching from postgress. Trying to apply kryoSerializer for glue job (will this help?) if yes how can we apply and verify?
billing_fee_df= glueContext.read.format("jdbc")\
.option("driver", "org.postgresql.Driver")\
.option("url", "jdbc:postgresql://prod-re.cbn6.us-west-2.rds.amazonaws.com:5432/db")\
.option("dbtable", "("sql query with specific column selection" from first_largest_table cc LEFT JOIN second_largest_table cf ON cc.id = cf.id LEFT JOIN thirdTable con ON cf.id=con.id where cc.maturity_date > ((current_date-45)::timestamp without time zone at time zone 'UTC')) as bs")\
.option("user", "postgres")\
.option("password", "pass")\
.option("numPartitions", "100")\
.option("partitionColumn", "date_col")\
.option("lowerBound", "2020-07-30 00:00:00.000")\
.option("upperBound", "2020-12-11 00:00:00.000").load()
below are the optimizations i'm already implementing
- trying to implement Broadcast on all the smaller tables.
- doing column pruning.
my job is currently finishing in 20 min. I'm looking for suggestions how to improve performance to finish the job in lessor time while considering cost aspects.
Any suggestions and questions are appreciated.