1
votes

I need to select elements in deeply nested data structures in Parquet files. The schema of the Parquet file is as follows:

 root
 |-- descriptor_type: string (nullable = true)
 |-- src_date: long (nullable = true)
 |-- downloaded: long (nullable = true)
 |-- exit_nodes: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- fingerprint: string (nullable = true)
 |    |    |-- published: long (nullable = true)
 |    |    |-- last_status: long (nullable = true)
 |    |    |-- exit_adresses: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: long (valueContainsNull = true)

An entry in the dataset, serialized as JSON, looks like this:

{
  "descriptor_type": "tordnsel 1.0",
  "src_date": 1472781720000,
  "downloaded": 1472781720000,
  "exit_nodes": [
    {
      "fingerprint": "CECCFA65F3EB16CA8C0F9EAC9050C348515E26C5",
      "published": 1472713568000,
      "last_status": 1472716961000,
      "exit_adresses": {
        "178.217.187.39": 1472717419000
      }
    },
...

I'm using Spark 2.0 integrated in SnappyData 0.6 in which the Parquet file is adressed like this:

snappy> CREATE EXTERNAL TABLE stage USING PARQUET OPTIONS (path './testdata.parquet.snappy');

Selecting the first row yields the following result:

snappy> select * from stage limit 1;
descriptor_type|src_date            |downloaded          |exit_nodes                                                                                                                      
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
tordnsel 1.0   |1472781720000       |1472781720000       |5704000060110000e011000060120000d812000058130000d813000058140000d014000050150000d015000050160000d016000048170000c81700004018000&

The field 'exit_nodes' contains only one long string, not the array of structs that I was naïvely hoping for.
I can select a specific element in the 'exit_nodes' array by index:

snappy> select exit_nodes[0].fingerprint, exit_nodes[0].published, exit_nodes[0].exit_adresses from stage limit 1;
EXIT_NODES[0].FINGERPRINT                                   |EXIT_NODES[0].PUBLISHED|EXIT_NODES[0].EXIT_ADRESSES                                                                 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
3D28E5FBD0C670C004E59D6CFDE7305BC8948FA8                    |1472750744000          |15000000010000000800000037382e3134322e31392e3231330100000008000000b057f0e656010000  

With the 'exit_adresses' map I had no luck though:

snappy> select exit_nodes[0].exit_adresses.key from stage limit 1;
EXIT_NODES[0].EXIT_ADRESSES[KEY]
--------------------------------
NULL    

So the questions are:
- how to select the keys and values in one 'exit_adresses' map?
- how to select all records in the 'element's array, or all key-value pairs in the nested map, to import them from the Parquet file into an RDBMS?

1

1 Answers

0
votes

I don't have a direct answer to this, but imho there is no further support of querying nested parquet types beyond this spark PR-360

This pretty much covers all you can do : https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala