Objective
I have an S3 folder full of json files with varying schemas, including arrays (a dynamodb backup, as it happens). However, while the schemas vary, all files contain some common elements, such as 'id' or 'name', as well as nested arrays of varying lengths, such as 'selected items'. I want to be able to parse out these elements at my leisure.
I have a working approach using an external ETL tool (KNIME), which I'm looking to replicate in a serverless manner via Glue.
Background
The working approach is:
- Load all the S3 data as an external table, using Spectrum, with each
json record as a single
varchar(65535)
entry. - Parse out the elements I need, including specific arrays, using Redshift SQL functions such as
json_extract_path_text
andjson_array_length
- Normalise required json arrays by cross joining against a reference table of array indexes
- Perform a required table join and write out to Redshift for consumption by Tableau
Now, this seems like a task that should be a fit for Glue. Namely, what I want to do is, either:
Load all data using Spectrum, as above
In Glue, create a dynamic frame from the Spectrum table
Parse data from the dynamic frame as above, using functions such as
pyspark.sql.functions.explode()
or maybe using Glue's Relationalize transform
Or:
- Crawl all scalar data into a single Glue schema (given that Glue doesn't yet support JSON arrays)
- Parse JSON and explode arrays using one of the approaches above
Results so far
Unfortunately I haven’t been able to get either of these approaches to work. For the various approaches, the blockers have been:
- Crawl json data using Glue - as per this post, Glue’s parser heuristics decide that the various schemas of the source are too different to relate to a single source and so parse them out as a bunch of different tables. It would be enough to have a crawler that simply crawled each file to produce a table with a single column of type varchar(65535) containing a json entry per row, but there doesn’t seem to be any Glue classifier JSON path expression that will achieve this.
The ‘single varchar(65535) column’ approach can be achieved by loading the data as external table in Spectrum, however it seems that Spectrum tables can’t be loaded into Glue as dynamic frames (note that the relevant table exists in the Glue catalog where it is shown as having the expected varchar(65535) schema). Working in Zeppelin Notebooks, I’ve found that
newFrame = glueContext.create_dynamic_frame.from_catalog(database="<spectrum database>", table_name="<spectrum one column varchar table>")
Runs successfully, however produces a table with
newFrame.count() = 0
andnewFrame.toDF().show(n)
, for any value of n produces an odd output of the form:++
||
++
++
In short, it appears that pySpark can’t work directly with Spectrum tables through Glue.
Crawl a Spectrum table using a Crawler. Here I have pointed a crawler at the desired table via Glue connection to my Redshift cluster. However, this gives an S3 endpoint validation failure which I have yet to be able to address. I’m reluctant to dig deep into the VPC configuration given that I’m already pretty uncertain that pointing a Crawler at a Spectrum table is going to be the right approach.
In short, I have found no way to dynamically ingest and parse non-scalar json on S3 using either a Glue Crawler or a combination of Glue and Redshift Spectrum. This is arguably not an esoteric task - in fact it is something that would need to be achieved by anyone wanting a relatively automated approach to reporting on data from a dynamodb based web app.
Question
So my question, in a single statement: Is there any way to parse non-scalar json files on S3, with inconsistent schemas, using Glue (plus, potentially another AWS service such as RS Spectrum)?