3
votes

We have the following requirement:

Yearly XML files (ranging from 15-20 gb in size) starting from 1990 to 2018+

Weekly XML files (ranging from 3-6 gb in size) that contains updated XML records to any of the yearly data from 1990 to 2018+

We need to run an ETL job to do the merge of weekly to yearly data in S3, and expose the integrated data to downstream applications on premise as an API

The path we are taking is AWS Glue for ETL merge and Potentially Athena for providing SQL query results for downstream applications

I am trying to ETL merge a few XML's (Insert/Update) in S3 using AWS Glue using Pyspark - to be precise, I am doing the following steps:

1) Create Dynamic Dataframes from Glue catalog (of the few XML source data) [there are 3 files in total of around 10gb]

For each year of data, perform 2) Convert Dynamic Dataframe to Spark Dataframe 3) Use Join (leftanti) and Union methods in order to merge the data 4) Write out JSON files (combined into a single file per year)

The jobs runs for 2 hours and from the logs - which I think it is too long. We have a need to scale to about 400+gb of file size to be processed and I am not sure if I am coding it the right way.

a)

I notice that the following statement appeared for almost the majority of the 2 hours

18/02/24 22:46:56 INFO Client: Application report for application_1519512218639_0001 (state: RUNNING)

18/02/24 22:46:56 DEBUG Client: client token: N/A diagnostics: N/A ApplicationMaster host: 172.31.58.135 ApplicationMaster RPC port: 0 queue: default start time: 1519512404681 final status: UNDEFINED tracking URL: http://ip-172-31-62-4.ec2.internal:20888/proxy/application_1519512218639_0001/ user: root

Does it mean that the job is actually running and it is taking time to process (or) is it waiting for resources?

b) In my code, I also tried using .cache() assuming this will execute the logic in memory. But, it does not have effect on the time taken.

c) The Logs indicate that the data is copied over to the hdfs nodes and the tasks are performed. How different it is from executing a similar code on EMR?

d) Any other suggestions to consider for improving the performance?

1
have you tried to setup development endpoint and check spark ui? It can show you what is happening with your job.Natalia
Thanks for the suggestion - I will try it and update!Sundar
@Natalia, how can I view the spark ui on a dev endpoint?human

1 Answers

0
votes

Did you try increasing the number of DPUs. Whats is the current configuration for DPUs? Please refer to aws.amazon.com/glue/faqs site for scaling the size and improving the performance of AWS Glue ETL jobs? They have mentioned suggestions on increasing the DPUs. Please note there are limits on the number of DPUs that can be used for an ETL/ DEV End points, beyond which you need to contact AWS Support.