0
votes

Objective

I have an S3 folder full of json files with varying schemas, including arrays (a dynamodb backup, as it happens). However, while the schemas vary, all files contain some common elements, such as 'id' or 'name', as well as nested arrays of varying lengths, such as 'selected items'. I want to be able to parse out these elements at my leisure.

I have a working approach using an external ETL tool (KNIME), which I'm looking to replicate in a serverless manner via Glue.

Background

The working approach is:

  1. Load all the S3 data as an external table, using Spectrum, with each json record as a single varchar(65535) entry.
  2. Parse out the elements I need, including specific arrays, using Redshift SQL functions such as json_extract_path_text and json_array_length
  3. Normalise required json arrays by cross joining against a reference table of array indexes
  4. Perform a required table join and write out to Redshift for consumption by Tableau

Now, this seems like a task that should be a fit for Glue. Namely, what I want to do is, either:

  1. Load all data using Spectrum, as above

  2. In Glue, create a dynamic frame from the Spectrum table

  3. Parse data from the dynamic frame as above, using functions such as pyspark.sql.functions.explode() or maybe using Glue's Relationalize transform

Or:

  1. Crawl all scalar data into a single Glue schema (given that Glue doesn't yet support JSON arrays)
  2. Parse JSON and explode arrays using one of the approaches above

Results so far

Unfortunately I haven’t been able to get either of these approaches to work. For the various approaches, the blockers have been:

  1. Crawl json data using Glue - as per this post, Glue’s parser heuristics decide that the various schemas of the source are too different to relate to a single source and so parse them out as a bunch of different tables. It would be enough to have a crawler that simply crawled each file to produce a table with a single column of type varchar(65535) containing a json entry per row, but there doesn’t seem to be any Glue classifier JSON path expression that will achieve this.
  2. The ‘single varchar(65535) column’ approach can be achieved by loading the data as external table in Spectrum, however it seems that Spectrum tables can’t be loaded into Glue as dynamic frames (note that the relevant table exists in the Glue catalog where it is shown as having the expected varchar(65535) schema). Working in Zeppelin Notebooks, I’ve found that

    newFrame = glueContext.create_dynamic_frame.from_catalog(database="<spectrum database>", table_name="<spectrum one column varchar table>")

    Runs successfully, however produces a table with newFrame.count() = 0 and newFrame.toDF().show(n), for any value of n produces an odd output of the form:

    ++ || ++ ++

    In short, it appears that pySpark can’t work directly with Spectrum tables through Glue.

  3. Crawl a Spectrum table using a Crawler. Here I have pointed a crawler at the desired table via Glue connection to my Redshift cluster. However, this gives an S3 endpoint validation failure which I have yet to be able to address. I’m reluctant to dig deep into the VPC configuration given that I’m already pretty uncertain that pointing a Crawler at a Spectrum table is going to be the right approach.

In short, I have found no way to dynamically ingest and parse non-scalar json on S3 using either a Glue Crawler or a combination of Glue and Redshift Spectrum. This is arguably not an esoteric task - in fact it is something that would need to be achieved by anyone wanting a relatively automated approach to reporting on data from a dynamodb based web app.

Question

So my question, in a single statement: Is there any way to parse non-scalar json files on S3, with inconsistent schemas, using Glue (plus, potentially another AWS service such as RS Spectrum)?

1

1 Answers

1
votes

So there's a few things that I assume are going on in the background.

I assume you defined an external table in Redshift Spectrum that points to S3? If so then that's not the best approach. Instead define an external schema that points to a table in Glue data catalog. The effect is that RS Spectrum will see everything in that Glue database and you don't need to define separate tables.

Secondly have you tried defining a table manually in Glue data catalog? I have done extensive testing with Parquet file format, messing with the table definitions and file contents. The result is that whatever queries from that table, it will only return the data defined in table. So you can define a table that has 'id', 'name', and 'selected items' fields and everything else will be ignored.

If for any reason the previous does not seem to work then a Glue Job will help. As a note here - always use just spark, never use anything glue_context related. Anyways, in spark.read.blaah you can specify a schema parameter. Use it.

dataSchema = StructType([StructField("id",StringType(),True) ,StructField("name",StringType(),True) ,StructField("selected items",ArrayType(.. etc ...),True) ])

That way you'll get the same result - it will only parse out the data that you want.

The setup in the end should probably be something like this:

  • S3 input bucket has JSON files
  • Glue job reads data from input bucket and writes to a different bucket, possibly partitioned and/or in a different data format
  • Glue table is defined on top of the second bucket/prefix
  • Redshift Spectrum is pointed at a database that contains the table defined in previous step