Bug writing parquet files with structs using pyarrow

Question

The Problem

I am having an issue writing a struct to parquet using pyarrow. There appear to be intermittent failures based on the size of the dataset. If I sub- or super-sample the dataset, it will sometimes write a valid dataset, sometimes not. I cannot discern any pattern to it.

I am writing a single column, with the schema

struct<creation_date: string, 
     expiration_date: string, 
     last_updated: string, 
     name_server: string, 
     registrar: string, 
     status: string>

This doesn't appear to be a versioning issue - the write succeeds sometimes, and I've been able to successfully write even more complex data types like lists of structs.

If I unnest the struct so each property gets its own column, things work fine - it's something with how structs are written.

After writing to disk, when I inspect with parquet-tools, I get the error org.apache.parquet.io.ParquetDecodingException: Can not read value at {n} in block 0 in file where n is whatever row is throwing the issue. There is nothing special about that particular row.

When I load the table into hive and try to explore it there, I get something slightly more illuminating:

Caused by: java.lang.IllegalArgumentException: Reading past RLE/BitPacking stream.
    at parquet.Preconditions.checkArgument(Preconditions.java:55)
    at parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readNext(RunLengthBitPackingHybridDecoder.java:82)
    at parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readInt(RunLengthBitPackingHybridDecoder.java:64)
    at parquet.column.values.dictionary.DictionaryValuesReader.readValueDictionaryId(DictionaryValuesReader.java:76)
    at parquet.column.impl.ColumnReaderImpl$1.read(ColumnReaderImpl.java:166)
    at parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:464)
    ... 35 more

Oddly, other data types look fine - there's something about this specific struct that is throwing errors. Here is the code needed to reproduce the issue:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import sys
# Command line argument to set how many rows in the dataset
_, n = sys.argv
n = int(n)

# Random whois data - should be a struct with the schema
# struct<creation_date: string, 
#     expiration_date: string, 
#     last_updated: string, 
#     name_server: string, 
#     registrar: string, 
#     status: string>
# nothing terribly interesting

df = pd.DataFrame({'whois':[
{'registrar': 'GoDaddy.com, LLC', 'creation_date': '2020-07-17T16:10:35', 'expiration_date': '2022-07-17T16:10:35', 'last_updated': None, 'name_server': 'ns59.domaincontrol.com\r', 'status': 'clientDeleteProhibited'},
{'registrar': 'Hongkong Domain Name Information Management Co., Limited', 'creation_date': '2020-07-17T10:28:36', 'expiration_date': '2021-07-17T10:28:36', 'last_updated': None, 'name_server': 'ns2.alidns.com\r', 'status': 'ok'},
{'registrar': 'GoDaddy.com, LLC', 'creation_date': '2020-07-17T04:04:06', 'expiration_date': '2021-07-17T04:04:06', 'last_updated': None, 'name_server': 'ns76.domaincontrol.com\r', 'status': 'clientDeleteProhibited'},
None
]})

# strangely, the bug only pops up for datasets of certain length
# When n is 2 or 5 it works fine, but 3 is busted.  
df = pd.concat([df for _ in range(n)]).sample(frac=1)
print(df.tail())
table = pa.Table.from_pandas(df, preserve_index=False)
print(table)
# The write doesn't throw any errors
pq.write_table(table, '/tmp/tst2.pa')
# This read is the bit that throws the error - it's some random OSError
df = pd.read_parquet('/tmp/tst2.pa')
print(df)

Updates

I've tried altering the number of items in the struct (e.g. only having the first 2 children) and that changes when the write fails, but still fails intermittently for some sizes of data.

Things I've tried

Upgrading the parquet version to 2.0
Disabling dictionary writes
Changing the compression settings
Changing some page file settings
Using a defined instead of imputed schema
Unnesting the struct (it works in this example, but not in my use case)

Environment

pyarrow==0.17.1
python==3.6.10
pandas=1.0.5

Questions

Is this a bug, a version mismatch, or something else?
If the issue is on my end, how should I fix it?
If this is a bug, who should should I report it to? The arrow devs? The parquet devs? Someone else?

0x26res 0x26res · Accepted Answer · 2020-07-22T16:42:49

Your Table schema has got nested struct. It's basically one column called whois containing user defined types with fields creation_date, expiration_date etc.

> table.schema
whois: struct<creation_date: string, expiration_date: string, last_updated: null, name_server: string, registrar: string, status: string>
  child 0, creation_date: string
  child 1, expiration_date: string
  child 2, last_updated: null
  child 3, name_server: string
  child 4, registrar: string
  child 5, status: string

Prior to 0.17.0, nested UDT (user defined types) were not supported when it comes to reading and writing to parquet. But this has being addressed here: https://issues.apache.org/jira/browse/ARROW-1644

If you're using an old version of arrow, considering you only have one column in your data frame, I'd recommend not using UDT:

df = pd.DataFrame([
    {'registrar': 'GoDaddy.com, LLC', 'creation_date': '2020-07-17T16:10:35', 'expiration_date': '2022-07-17T16:10:35', 'last_updated': None, 'name_server': 'ns59.domaincontrol.com\r', 'status': 'clientDeleteProhibited'},
    {'registrar': 'Hongkong Domain Name Information Management Co., Limited', 'creation_date': '2020-07-17T10:28:36', 'expiration_date': '2021-07-17T10:28:36', 'last_updated': None, 'name_server': 'ns2.alidns.com\r', 'status': 'ok'},
    {'registrar': 'GoDaddy.com, LLC', 'creation_date': '2020-07-17T04:04:06', 'expiration_date': '2021-07-17T04:04:06', 'last_updated': None, 'name_server': 'ns76.domaincontrol.com\r', 'status': 'clientDeleteProhibited'},
    {}
])

table = pa.Table.from_pandas(df, preserve_index=False)
pq.write_table(table, '/tmp/tst2.pa')
df = pd.read_parquet('/tmp/tst2.pa')

Another option is to flatten your table directly in pandas:

df = pd.DataFrame({'whois':[
{'registrar': 'GoDaddy.com, LLC', 'creation_date': '2020-07-17T16:10:35', 'expiration_date': '2022-07-17T16:10:35', 'last_updated': None, 'name_server': 'ns59.domaincontrol.com\r', 'status': 'clientDeleteProhibited'},
{'registrar': 'Hongkong Domain Name Information Management Co., Limited', 'creation_date': '2020-07-17T10:28:36', 'expiration_date': '2021-07-17T10:28:36', 'last_updated': None, 'name_server': 'ns2.alidns.com\r', 'status': 'ok'},
{'registrar': 'GoDaddy.com, LLC', 'creation_date': '2020-07-17T04:04:06', 'expiration_date': '2021-07-17T04:04:06', 'last_updated': None, 'name_server': 'ns76.domaincontrol.com\r', 'status': 'clientDeleteProhibited'},
None
]})
table = pa.Table.from_pandas(df, preserve_index=False).flatten()
df = pd.read_parquet('/tmp/tst2.pa')
df = pd.read_parquet('/tmp/tst2.pa')

As a side note you may want to provide your own schema as pandas and arrow are trying to guess the type of the columns but they are not doing a good job for null columns (last_updated is defaulting to float or null)

> table.schema
creation_date: string
expiration_date: string
last_updated: double
name_server: string
registrar: string
status: string

So instead you could do something like:

df = pd.DataFrame([
    {'registrar': 'GoDaddy.com, LLC', 'creation_date': '2020-07-17T16:10:35', 'expiration_date': '2022-07-17T16:10:35', 'last_updated': None, 'name_server': 'ns59.domaincontrol.com\r', 'status': 'clientDeleteProhibited'},
    {'registrar': 'Hongkong Domain Name Information Management Co., Limited', 'creation_date': '2020-07-17T10:28:36', 'expiration_date': '2021-07-17T10:28:36', 'last_updated': None, 'name_server': 'ns2.alidns.com\r', 'status': 'ok'},
    {'registrar': 'GoDaddy.com, LLC', 'creation_date': '2020-07-17T04:04:06', 'expiration_date': '2021-07-17T04:04:06', 'last_updated': None, 'name_server': 'ns76.domaincontrol.com\r', 'status': 'clientDeleteProhibited'},
    {}
])

table_schema = pa.schema([
    pa.field('creation_date', pa.string()),
    pa.field('expiration_date', pa.string()),
    pa.field('last_updated', pa.string()),
    pa.field('name_server', pa.string()),
    pa.field('registrar', pa.string()),
    pa.field('status', pa.string()),
    
])

table = pa.Table.from_pandas(df, preserve_index=False)
pq.write_table(table, '/tmp/tst2.pa')
df = pd.read_parquet('/tmp/tst2.pa')