I'm crawling following JSON file (it's a valid JSON) from s3 data lake. Inside there are 2 fields (device, timestamp) and an array of objects called "data". Each object in the data array differs from one another.
{
"device": "0013374838793C8",
"timestamp": "2019-03-04T14:44:39Z",
"data": [
{ "eparke_status": "09" },
{ "eparke_x": "FFF588" },
{ "eparke_y": "000352" },
{ "eparke_z": "000ACC" },
{ "eparke_temp": "14.00" },
{ "eparke_voltage": "4.17" }
]
}
Unfortunately, when I'm crawling with AWS Glue crawler schema cannot be inferred properly and what I got in Athena is not what I expect.
Following listing shows a row of data from AWS Athena.
1 0013374838793C8 2019-03-05T13:11:41Z [{eparke_status=0B, eparke_x=null, eparke_y=null, eparke_z=null, eparke_temp=null, eparke_voltage=null}, {eparke_status=null, eparke_x=FFF6D4, eparke_y=null, eparke_z=null, eparke_temp=null, eparke_voltage=null}, {eparke_status=null, eparke_x=null, eparke_y=000133, eparke_z=null, eparke_temp=null, eparke_voltage=null}, {eparke_status=null, eparke_x=null, eparke_y=null, eparke_z=000DA3, eparke_temp=null, eparke_voltage=null}, {eparke_status=null, eparke_x=null, eparke_y=null, eparke_z=null, eparke_temp=14.00, eparke_voltage=null}, {eparke_status=null, eparke_x=null, eparke_y=null, eparke_z=null, eparke_temp=null, eparke_voltage=4.17}]
As you can see for each object inside array schema is discovered "wrongly". Each column in DB contains ALL of the array objects fields, the majority of which are just set as nulls, which is understandable because they are not found. The discovered schema is not what I'm looking for.
Expectations
Following listing shows expected form of a table row after crawling with AWS Glue.
1 0013374838793C8 2019-03-05T13:11:41Z eparke_status=0B eparke_x=FFF6D4 eparke_y=000133 eparke_z=000DA3 eparke_temp=14.00 eparke_voltage=4.17
What I have tried so far?
AWS Glue Classifiers To force schema I tried to use classifiers.
$.device $.timestamp $.eparke_status $.eparke_x $.eparke_y $.eparke_z $.eparke_temp $.eparke_voltage
and
$.device $.timestamp $.data[0].eparke_status $.data[1].eparke_x $.data[2].eparke_y $.data[3].eparke_z $.data[4].eparke_temp $.data[5].eparke_voltage
Still, final schema looks the same - all objects are packed inside each column.
Any ideas how to address this issue? I'm also trying to configure ETL job with a custom script but failed so far.