I am trying to run a simple ETL process using AWS Glue.
The process is simple: use a JDBC connector to read from 20+ tables from a Database, and then sink them in S3. Everything works fine, the only issue is the amount of time it is required to run the job (2+ hours).
The main bottleneck is caused by some very large tables (16 to 20 millions records), and by the fact that I have to extract number of rows and fields list. The glue job uses Python 3, Spark 3, 2 workers (of which 1 driver).
I first read the table:
df = sparkSession.read.format("jdbc").option("url", connection_url).option("dbtable", table).option("driver", DRIVER).load()
Then I convert it to a GlueDynamicFrame (as it is easier for me to run operations on it):
df = DynamicFrame.fromDF(df, glueContext, "df")
Then I proceed to calculate number of rows:
n_rows = df.count()
Which starts the pain: for some tables (the biggest ones) it takes 10 to 20 minutes to just return this value. I have researched and (I think) understand the concept of lazy-evaluations and computations in Spark, but it seems to me that this operation should take way less anyway and I am surely doing something wrong. Anyway, then I proceed to generate a field list:
fields = [df.schema().fields[x].name for x in range(0, len(df.schema().fields))]
Which again, 10 to 20 minutes to run. Eventually, I sink the dataframe:
glueContext.write_dynamic_frame.\
from_options(frame = df,
connection_type = "s3",
connection_options = {"path": path,
"partitionKeys": [partition]},
format = "parquet")
Which again, it takes a long time for these large tables.
It is worth mentioning that I extract from db tables that contain few rows as well. I mention this as I have read as possible solution to repartition as soon as I read the table, but it would make zero sense to repartition a DataFrame of 3 rows.
The only way of doing it systematically would be to count rows first, and then base on n_rows repartition, but it takes already forever.Also, I have read that the number of partitions should be somewhat related to the number of workers. I have 1 worker, so 1 partition seems logical to me.
My question would be: what am I doing wrong? Should I just increase number of workers and repartition accordingly at the moment of reading? Or what other solutions are available? Thanks a lot for any advice!