0
votes

I am trying to run a simple ETL process using AWS Glue.

The process is simple: use a JDBC connector to read from 20+ tables from a Database, and then sink them in S3. Everything works fine, the only issue is the amount of time it is required to run the job (2+ hours).

The main bottleneck is caused by some very large tables (16 to 20 millions records), and by the fact that I have to extract number of rows and fields list. The glue job uses Python 3, Spark 3, 2 workers (of which 1 driver).

I first read the table:

df = sparkSession.read.format("jdbc").option("url", connection_url).option("dbtable", table).option("driver", DRIVER).load()

Then I convert it to a GlueDynamicFrame (as it is easier for me to run operations on it):

df = DynamicFrame.fromDF(df, glueContext, "df")

Then I proceed to calculate number of rows:

n_rows = df.count()

Which starts the pain: for some tables (the biggest ones) it takes 10 to 20 minutes to just return this value. I have researched and (I think) understand the concept of lazy-evaluations and computations in Spark, but it seems to me that this operation should take way less anyway and I am surely doing something wrong. Anyway, then I proceed to generate a field list:

fields = [df.schema().fields[x].name for x in range(0, len(df.schema().fields))]

Which again, 10 to 20 minutes to run. Eventually, I sink the dataframe:

glueContext.write_dynamic_frame.\
            from_options(frame = df,
                        connection_type = "s3",
                        connection_options = {"path": path,
                                              "partitionKeys": [partition]},
                        format = "parquet")

Which again, it takes a long time for these large tables.

It is worth mentioning that I extract from db tables that contain few rows as well. I mention this as I have read as possible solution to repartition as soon as I read the table, but it would make zero sense to repartition a DataFrame of 3 rows.

The only way of doing it systematically would be to count rows first, and then base on n_rows repartition, but it takes already forever.Also, I have read that the number of partitions should be somewhat related to the number of workers. I have 1 worker, so 1 partition seems logical to me.

My question would be: what am I doing wrong? Should I just increase number of workers and repartition accordingly at the moment of reading? Or what other solutions are available? Thanks a lot for any advice!