AWS Glue - Derive schema from catalog instead of data source header

Question

I have created a table in AWS Glue Catalog pointing to an S3 location. I am using AWS Glue ETL to read any new file in the S3 location. A data file will have first record as header. However, certain times there is an empty file being dropped in S3 with no data and no headers. Since the file doesn't have any header information as well, this causes my ETL job to fail saying 'Cannot resolve given input columns'.

My question - Is there a way to NOT read schema from file headers but just from the AWS Glue Catalog. I have already defined the schema in the catalog. I would still want to skip the first line from data files while reading but not treat it as header.

Below is the code I am trying -

datasource1 = glueContext.create_dynamic_frame.from_catalog(database = "db", table_name = "test", transformation_ctx = "datasource1")
datasource1DF = datasource1.toDF()
datasource1DF.select(col('updatedtimestamppdt')).show()

Error -

Fail to execute line 1: datasource1DF.orderBy(col('updatedtimestamppdt'), ascending=False).select(col('updatedtimestamppdt')).distinct().show() Traceback (most recent call last): File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o1616.sort. : org.apache.spark.sql.AnalysisException: cannot resolve 'updatedtimestamppdt' given input columns: [];;

Emerson Emerson · Accepted Answer · 2020-03-26T02:49:08

have you tried...( as long as youve checked the box that exposes your glue catalog as a hive metastore)

df= spark.sql('select * from yourgluedatabase.yourgluetable')

AWS Glue - Derive schema from catalog instead of data source header

1 Answers