I am creating parquet files using Pandas and pyarrow and then reading schema of those files using Java (org.apache.parquet.avro.AvroParquetReader).
I found out that parquet files created using pandas + pyarrow always encode arrays of primitive types using an array of records with single field.
I observed same behaviour when using PySpark. There is similar question here Spark writing Parquet array<string> converts to a different datatype when loading into BigQuery
Here is the python script to create parquet file:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
df = pd.DataFrame(
{
'organizationId' : ['org1', 'org2', 'org3'],
'entityType' : ['customer', 'customer', 'customer'],
'entityId' : ['cust_1', 'cust_2', 'cust_3'],
'customerProducts' : [['p1', 'p2'], ['p4', 'p5'], ['p1', 'p3']]
}
)
table = pa.Table.from_pandas(df)
pq.write_table(table, 'output.parquet')
When I try to read Avro schema of that parquet file I see the following schema for 'customerProducts' field:
{"type":"array","items":{"type":"record","name":"list","fields":[{"name":"item","type":["null","string"],"default":null}]}}
but I would expect something this:
{"type":"array","type":["null","string"],"default":null}]}}
Anyone knows if there is a way to make sure that created parquet files with arrays of primitive types will have simplest schema possible?
thanks