I'm having issues reading data with a AWS Glue Job in PySpark:
Data is sent from a AWS firehose (sample data) to a s3 bucket, stored as JSON and compressed with snappy-hadoop.
I'm able to read data from legacy Spark dataframe with spark.read.json() but this won't work with Glue Dynamic Frame (schema is not parsed at all) using from_catalog or from_options method :
Spark Legacy DataFrame
# import from legacy spark read
spark_df = spark.read.json("s3://my-bucket/sample-json-hadoop-snappy/")
spark_df.printSchema()
- result:
root
|-- change: double (nullable = true)
|-- price: double (nullable = true)
|-- sector: string (nullable = true)
|-- ticker_symbol: string (nullable = true)
|-- year: integer (nullable = true)
|-- dt: date (nullable = true)
Glue DynamicFrame
# import from glue options
options_df = glueContext.create_dynamic_frame.from_options(
connection_type="s3",
connection_options = {"paths": ["s3://my-bucket/sample-json-hadoop-snappy/"]},
format="json"
)
options_df.printSchema()
- result:
root