2
votes

I am creating parquet files using Pandas and pyarrow and then reading schema of those files using Java (org.apache.parquet.avro.AvroParquetReader).

I found out that parquet files created using pandas + pyarrow always encode arrays of primitive types using an array of records with single field.

I observed same behaviour when using PySpark. There is similar question here Spark writing Parquet array<string> converts to a different datatype when loading into BigQuery

Here is the python script to create parquet file:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


df = pd.DataFrame(
  {
    'organizationId' : ['org1', 'org2', 'org3'],
    'entityType' : ['customer', 'customer', 'customer'],
    'entityId' : ['cust_1', 'cust_2', 'cust_3'],
    'customerProducts' : [['p1', 'p2'], ['p4', 'p5'], ['p1', 'p3']]
  }
)

table = pa.Table.from_pandas(df)
pq.write_table(table, 'output.parquet')

When I try to read Avro schema of that parquet file I see the following schema for 'customerProducts' field:

{"type":"array","items":{"type":"record","name":"list","fields":[{"name":"item","type":["null","string"],"default":null}]}}

but I would expect something this:

{"type":"array","type":["null","string"],"default":null}]}}

Anyone knows if there is a way to make sure that created parquet files with arrays of primitive types will have simplest schema possible?

thanks

1
Hmm... after researching this looks like this is fairly standard practice - but I am still not sure why. I found this spark.apache.org/docs/1.3.0/api/java/org/apache/spark/sql/… and github.com/apache/parquet-mr/blob/master/parquet-column/src/…anthony

1 Answers

1
votes

As far as I know the parquet data model follows the capacitor data model which allows a column to be one of three types:

  1. Required
  2. optional
  3. repeated.

In order to represent a list the nested type is needed to add an additional level of indirection to distinguish between empty lists and lists containing only null values.