4
votes

I'm crawling following JSON file (it's a valid JSON) from s3 data lake. Inside there are 2 fields (device, timestamp) and an array of objects called "data". Each object in the data array differs from one another.

{
  "device": "0013374838793C8",
  "timestamp": "2019-03-04T14:44:39Z",
  "data": [
    { "eparke_status": "09" },
    { "eparke_x": "FFF588" },
    { "eparke_y": "000352" },
    { "eparke_z": "000ACC" },
    { "eparke_temp": "14.00" },
    { "eparke_voltage": "4.17" }
  ]
}

Unfortunately, when I'm crawling with AWS Glue crawler schema cannot be inferred properly and what I got in Athena is not what I expect.

Following listing shows a row of data from AWS Athena.

1   0013374838793C8 2019-03-05T13:11:41Z    [{eparke_status=0B, eparke_x=null, eparke_y=null, eparke_z=null, eparke_temp=null, eparke_voltage=null}, {eparke_status=null, eparke_x=FFF6D4, eparke_y=null, eparke_z=null, eparke_temp=null, eparke_voltage=null}, {eparke_status=null, eparke_x=null, eparke_y=000133, eparke_z=null, eparke_temp=null, eparke_voltage=null}, {eparke_status=null, eparke_x=null, eparke_y=null, eparke_z=000DA3, eparke_temp=null, eparke_voltage=null}, {eparke_status=null, eparke_x=null, eparke_y=null, eparke_z=null, eparke_temp=14.00, eparke_voltage=null}, {eparke_status=null, eparke_x=null, eparke_y=null, eparke_z=null, eparke_temp=null, eparke_voltage=4.17}]

As you can see for each object inside array schema is discovered "wrongly". Each column in DB contains ALL of the array objects fields, the majority of which are just set as nulls, which is understandable because they are not found. The discovered schema is not what I'm looking for.
Expectations

Following listing shows expected form of a table row after crawling with AWS Glue.

1   0013374838793C8 2019-03-05T13:11:41Z    eparke_status=0B eparke_x=FFF6D4 eparke_y=000133 eparke_z=000DA3 eparke_temp=14.00 eparke_voltage=4.17

What I have tried so far?

AWS Glue Classifiers To force schema I tried to use classifiers.

$.device $.timestamp $.eparke_status $.eparke_x $.eparke_y $.eparke_z $.eparke_temp $.eparke_voltage

and

$.device $.timestamp $.data[0].eparke_status $.data[1].eparke_x $.data[2].eparke_y $.data[3].eparke_z $.data[4].eparke_temp $.data[5].eparke_voltage

Still, final schema looks the same - all objects are packed inside each column.

Any ideas how to address this issue? I'm also trying to configure ETL job with a custom script but failed so far.

1

1 Answers

1
votes

One thing I noticed is that once a crawler runs once, the initially inferred schema and selected crawlers tend to not change on a new run. I just think it is safer to duplicate crawlers and delete any previously created tables while playing around.

I am not sure you can concatenate multiple root expressions in the Json classifier expression. Documentation says that for JSON classifier you just need to provide THE path to the node of each line that will be considered as the actual json to infer the schema from

To use each element of the array to infer the schema, you would have to use $.data[*]. But that would mean you would miss the device and timestamp.

You cannot do this simply through a crawler. My recommendation is to parse with no custom classifier and then UNNEST the data from the array structure using an Athena query (https://docs.aws.amazon.com/athena/latest/ug/flattening-arrays.html). Load the result to some datastore if needed. For S3, look at CTAS as an option. You might also be able to configure this as a ETL job