Unnesting nested JSON structures in Apache Drill

Question

I have the following JSON (roughly) and I'd like to extract the information from the header and defects fields separately:

{
  "file": {
    "header": {
      "timeStamp": "2016-03-14T00:20:15.005+04:00",
      "serialNo": "3456",
      "sensorId": "1234567890",
    },
    "defects": [
      {
        "info": {
          "systemId": "DEFCHK123",
          "numDefects": "3",
          "defectParts": [
            "003", "006", "008"
          ]
        }
      }
    ]
  }
}

I have tried to access the individual elements with file.header.timeStamp etc but that returns null. I have tried using flatten(file) but that gives me

Cannot cast org.apache.drill.exec.vector.complex.MapVector to org.apache.drill.exec.vector.complex.RepeatedValueVector

I've looked into kvgen() but don't see how that fits in my case. I tried kvgen(file.header) but that gets me

kvgen function only supports Simple maps as input

which is what I had expected anyway.

Does anyone know how I can get header and defects, so I can process the information contained in them. Ideally, I'd just select the information from header because it contains no arrays or maps, so I can take individual records as they are. For defects I'd simply use FLATTEN(defectParts) to obtain a table of the defective parts.

Any help would be appreciated.

adeneche adeneche · Accepted Answer · 2016-03-15T06:17:23

What version of Drill are you using ? I tried querying the following file on latest master (1.7.0-SNAPHOT):

{
  "file": {
    "header": {
      "timeStamp": "2016-03-14T00:20:15.005+04:00",
      "serialNo": "3456",
      "sensorId": "1234567890"
    },
    "defects": [
      {
        "info": {
          "systemId": "DEFCHK123",
          "numDefects": "3",
          "defectParts": [
            "003", "006", "008"
          ]
        }
      }
    ]
  }
}
{
  "file": {
    "header": {
      "timeStamp": "2016-03-14T00:20:15.005+04:00",
      "serialNo": "3456",
      "sensorId": "1234567890"
    },
    "defects": [
      {
        "info": {
          "systemId": "DEFCHK123",
          "numDefects": "3",
          "defectParts": [
            "003", "006", "008"
          ]
        }
      }
    ]
  }
}

And the following queries are working fine: 1.

select t.file.header.serialno as serialno from `parts.json` t;
+-----------+
| serialno  |
+-----------+
| 3456      |
| 3456      |
+-----------+
2 rows selected (0.098 seconds)

2.

select flatten(t.file.defects) defects from `parts.json` t;
+---------------------------------------------------------------------------------------+
|                                        defects                                        |
+---------------------------------------------------------------------------------------+
| {"info":{"systemId":"DEFCHK123","numDefects":"3","defectParts":["003","006","008"]}}  |
| {"info":{"systemId":"DEFCHK123","numDefects":"3","defectParts":["003","006","008"]}}  |
+---------------------------------------------------------------------------------------+

3.

select q.h.serialno as serialno, q.d.info.defectParts as defectParts from (select t.file.header h, flatten(t.file.defects) d from `parts.json` t) q;
+-----------+----------------------+
| serialno  |     defectParts      |
+-----------+----------------------+
| 3456      | ["003","006","008"]  |
| 3456      | ["003","006","008"]  |
+-----------+----------------------+
2 rows selected (0.126 seconds)

PS: This should've been a comment but I don't have enough rep yet!

Unnesting nested JSON structures in Apache Drill

2 Answers