The Problem
I am having an issue writing a struct to parquet using pyarrow. There appear to be intermittent failures based on the size of the dataset. If I sub- or super-sample the dataset, it will sometimes write a valid dataset, sometimes not. I cannot discern any pattern to it.
I am writing a single column, with the schema
struct<creation_date: string,
expiration_date: string,
last_updated: string,
name_server: string,
registrar: string,
status: string>
This doesn't appear to be a versioning issue - the write succeeds sometimes, and I've been able to successfully write even more complex data types like lists of structs.
If I unnest the struct so each property gets its own column, things work fine - it's something with how structs are written.
After writing to disk, when I inspect with parquet-tools
, I get the error org.apache.parquet.io.ParquetDecodingException: Can not read value at {n} in block 0 in file
where n is whatever row is throwing the issue. There is nothing special about that particular row.
When I load the table into hive and try to explore it there, I get something slightly more illuminating:
Caused by: java.lang.IllegalArgumentException: Reading past RLE/BitPacking stream.
at parquet.Preconditions.checkArgument(Preconditions.java:55)
at parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readNext(RunLengthBitPackingHybridDecoder.java:82)
at parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readInt(RunLengthBitPackingHybridDecoder.java:64)
at parquet.column.values.dictionary.DictionaryValuesReader.readValueDictionaryId(DictionaryValuesReader.java:76)
at parquet.column.impl.ColumnReaderImpl$1.read(ColumnReaderImpl.java:166)
at parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:464)
... 35 more
Oddly, other data types look fine - there's something about this specific struct that is throwing errors. Here is the code needed to reproduce the issue:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import sys
# Command line argument to set how many rows in the dataset
_, n = sys.argv
n = int(n)
# Random whois data - should be a struct with the schema
# struct<creation_date: string,
# expiration_date: string,
# last_updated: string,
# name_server: string,
# registrar: string,
# status: string>
# nothing terribly interesting
df = pd.DataFrame({'whois':[
{'registrar': 'GoDaddy.com, LLC', 'creation_date': '2020-07-17T16:10:35', 'expiration_date': '2022-07-17T16:10:35', 'last_updated': None, 'name_server': 'ns59.domaincontrol.com\r', 'status': 'clientDeleteProhibited'},
{'registrar': 'Hongkong Domain Name Information Management Co., Limited', 'creation_date': '2020-07-17T10:28:36', 'expiration_date': '2021-07-17T10:28:36', 'last_updated': None, 'name_server': 'ns2.alidns.com\r', 'status': 'ok'},
{'registrar': 'GoDaddy.com, LLC', 'creation_date': '2020-07-17T04:04:06', 'expiration_date': '2021-07-17T04:04:06', 'last_updated': None, 'name_server': 'ns76.domaincontrol.com\r', 'status': 'clientDeleteProhibited'},
None
]})
# strangely, the bug only pops up for datasets of certain length
# When n is 2 or 5 it works fine, but 3 is busted.
df = pd.concat([df for _ in range(n)]).sample(frac=1)
print(df.tail())
table = pa.Table.from_pandas(df, preserve_index=False)
print(table)
# The write doesn't throw any errors
pq.write_table(table, '/tmp/tst2.pa')
# This read is the bit that throws the error - it's some random OSError
df = pd.read_parquet('/tmp/tst2.pa')
print(df)
Updates
- I've tried altering the number of items in the struct (e.g. only having the first 2 children) and that changes when the write fails, but still fails intermittently for some sizes of data.
Things I've tried
- Upgrading the parquet version to 2.0
- Disabling dictionary writes
- Changing the compression settings
- Changing some page file settings
- Using a defined instead of imputed schema
- Unnesting the struct (it works in this example, but not in my use case)
Environment
pyarrow==0.17.1
python==3.6.10
pandas=1.0.5
Questions
- Is this a bug, a version mismatch, or something else?
- If the issue is on my end, how should I fix it?
- If this is a bug, who should should I report it to? The arrow devs? The parquet devs? Someone else?