3
votes

I have an ETL job in AWS glue that is triggered by a scheduler. My ETL language is python. I am trying to - Write the result of a query in an s3 bucket. For this, I have used sparkSql. This job is failing when it is triggered by the scheduler but it succeeds when running it manually. It is throwing an error for a column (eventdate) which is available in spark df.

Below is the log.

Traceback (most recent call last):
File "script_2018-06-22-11-10-05.py", line 48, in <module>
error_report_result_df = spark.sql(sql_query)
File "/mnt/yarn/usercache/root/appcache/application_1529665635815_0001/container_1529665635815_0001_01_000001/pyspark.zip/pyspark/sql/session.py", line 603, in sql
File "/mnt/yarn/usercache/root/appcache/application_1529665635815_0001/container_1529665635815_0001_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/mnt/yarn/usercache/root/appcache/application_1529665635815_0001/container_1529665635815_0001_01_000001/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: u"cannot resolve '`eventdate`' given input columns: []; line 1 pos 480;1
1
Can you also paste your code? Is your code parametric? Are you passing same parameters?botchniaque

1 Answers

5
votes

This is happening because of bookmark. I've have enabled the Job Bookmark [1] in my trigger definition. Please note that this is also the default choice when you create a Trigger. In this case, when the glueContext is called and it sees that there aren't new data to be processed, it returns an empty Dataframe (DF) and spark cannot infer any schema from it. This explains why the registered table on the DF does not have any field. The same doesn't apply if the script is launched by the Web Console because by default the Job Bookmark is disabled. When I disabled the bookmark, it worked.

[1] Job Bookmarks https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html