We have the following requirement:
Yearly XML files (ranging from 15-20 gb in size) starting from 1990 to 2018+
Weekly XML files (ranging from 3-6 gb in size) that contains updated XML records to any of the yearly data from 1990 to 2018+
We need to run an ETL job to do the merge of weekly to yearly data in S3, and expose the integrated data to downstream applications on premise as an API
The path we are taking is AWS Glue for ETL merge and Potentially Athena for providing SQL query results for downstream applications
I am trying to ETL merge a few XML's (Insert/Update) in S3 using AWS Glue using Pyspark - to be precise, I am doing the following steps:
1) Create Dynamic Dataframes from Glue catalog (of the few XML source data) [there are 3 files in total of around 10gb]
For each year of data, perform 2) Convert Dynamic Dataframe to Spark Dataframe 3) Use Join (leftanti) and Union methods in order to merge the data 4) Write out JSON files (combined into a single file per year)
The jobs runs for 2 hours and from the logs - which I think it is too long. We have a need to scale to about 400+gb of file size to be processed and I am not sure if I am coding it the right way.
a)
I notice that the following statement appeared for almost the majority of the 2 hours
18/02/24 22:46:56 INFO Client: Application report for application_1519512218639_0001 (state: RUNNING)
18/02/24 22:46:56 DEBUG Client: client token: N/A diagnostics: N/A ApplicationMaster host: 172.31.58.135 ApplicationMaster RPC port: 0 queue: default start time: 1519512404681 final status: UNDEFINED tracking URL: http://ip-172-31-62-4.ec2.internal:20888/proxy/application_1519512218639_0001/ user: root
Does it mean that the job is actually running and it is taking time to process (or) is it waiting for resources?
b) In my code, I also tried using .cache() assuming this will execute the logic in memory. But, it does not have effect on the time taken.
c) The Logs indicate that the data is copied over to the hdfs nodes and the tasks are performed. How different it is from executing a similar code on EMR?
d) Any other suggestions to consider for improving the performance?