0
votes

The Problem

I am having an issue writing a struct to parquet using pyarrow. There appear to be intermittent failures based on the size of the dataset. If I sub- or super-sample the dataset, it will sometimes write a valid dataset, sometimes not. I cannot discern any pattern to it.

I am writing a single column, with the schema

struct<creation_date: string, 
     expiration_date: string, 
     last_updated: string, 
     name_server: string, 
     registrar: string, 
     status: string>

This doesn't appear to be a versioning issue - the write succeeds sometimes, and I've been able to successfully write even more complex data types like lists of structs.

If I unnest the struct so each property gets its own column, things work fine - it's something with how structs are written.

After writing to disk, when I inspect with parquet-tools, I get the error org.apache.parquet.io.ParquetDecodingException: Can not read value at {n} in block 0 in file where n is whatever row is throwing the issue. There is nothing special about that particular row.

When I load the table into hive and try to explore it there, I get something slightly more illuminating:

Caused by: java.lang.IllegalArgumentException: Reading past RLE/BitPacking stream.
    at parquet.Preconditions.checkArgument(Preconditions.java:55)
    at parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readNext(RunLengthBitPackingHybridDecoder.java:82)
    at parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readInt(RunLengthBitPackingHybridDecoder.java:64)
    at parquet.column.values.dictionary.DictionaryValuesReader.readValueDictionaryId(DictionaryValuesReader.java:76)
    at parquet.column.impl.ColumnReaderImpl$1.read(ColumnReaderImpl.java:166)
    at parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:464)
    ... 35 more

Oddly, other data types look fine - there's something about this specific struct that is throwing errors. Here is the code needed to reproduce the issue:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import sys
# Command line argument to set how many rows in the dataset
_, n = sys.argv
n = int(n)

# Random whois data - should be a struct with the schema
# struct<creation_date: string, 
#     expiration_date: string, 
#     last_updated: string, 
#     name_server: string, 
#     registrar: string, 
#     status: string>
# nothing terribly interesting

df = pd.DataFrame({'whois':[
{'registrar': 'GoDaddy.com, LLC', 'creation_date': '2020-07-17T16:10:35', 'expiration_date': '2022-07-17T16:10:35', 'last_updated': None, 'name_server': 'ns59.domaincontrol.com\r', 'status': 'clientDeleteProhibited'},
{'registrar': 'Hongkong Domain Name Information Management Co., Limited', 'creation_date': '2020-07-17T10:28:36', 'expiration_date': '2021-07-17T10:28:36', 'last_updated': None, 'name_server': 'ns2.alidns.com\r', 'status': 'ok'},
{'registrar': 'GoDaddy.com, LLC', 'creation_date': '2020-07-17T04:04:06', 'expiration_date': '2021-07-17T04:04:06', 'last_updated': None, 'name_server': 'ns76.domaincontrol.com\r', 'status': 'clientDeleteProhibited'},
None
]})

# strangely, the bug only pops up for datasets of certain length
# When n is 2 or 5 it works fine, but 3 is busted.  
df = pd.concat([df for _ in range(n)]).sample(frac=1)
print(df.tail())
table = pa.Table.from_pandas(df, preserve_index=False)
print(table)
# The write doesn't throw any errors
pq.write_table(table, '/tmp/tst2.pa')
# This read is the bit that throws the error - it's some random OSError
df = pd.read_parquet('/tmp/tst2.pa')
print(df)
Updates
  • I've tried altering the number of items in the struct (e.g. only having the first 2 children) and that changes when the write fails, but still fails intermittently for some sizes of data.

Things I've tried

  • Upgrading the parquet version to 2.0
  • Disabling dictionary writes
  • Changing the compression settings
  • Changing some page file settings
  • Using a defined instead of imputed schema
  • Unnesting the struct (it works in this example, but not in my use case)

Environment

  • pyarrow==0.17.1
  • python==3.6.10
  • pandas=1.0.5

Questions

  • Is this a bug, a version mismatch, or something else?
  • If the issue is on my end, how should I fix it?
  • If this is a bug, who should should I report it to? The arrow devs? The parquet devs? Someone else?
1

1 Answers

1
votes

Your Table schema has got nested struct. It's basically one column called whois containing user defined types with fields creation_date, expiration_date etc.

> table.schema
whois: struct<creation_date: string, expiration_date: string, last_updated: null, name_server: string, registrar: string, status: string>
  child 0, creation_date: string
  child 1, expiration_date: string
  child 2, last_updated: null
  child 3, name_server: string
  child 4, registrar: string
  child 5, status: string

Prior to 0.17.0, nested UDT (user defined types) were not supported when it comes to reading and writing to parquet. But this has being addressed here: https://issues.apache.org/jira/browse/ARROW-1644

If you're using an old version of arrow, considering you only have one column in your data frame, I'd recommend not using UDT:

df = pd.DataFrame([
    {'registrar': 'GoDaddy.com, LLC', 'creation_date': '2020-07-17T16:10:35', 'expiration_date': '2022-07-17T16:10:35', 'last_updated': None, 'name_server': 'ns59.domaincontrol.com\r', 'status': 'clientDeleteProhibited'},
    {'registrar': 'Hongkong Domain Name Information Management Co., Limited', 'creation_date': '2020-07-17T10:28:36', 'expiration_date': '2021-07-17T10:28:36', 'last_updated': None, 'name_server': 'ns2.alidns.com\r', 'status': 'ok'},
    {'registrar': 'GoDaddy.com, LLC', 'creation_date': '2020-07-17T04:04:06', 'expiration_date': '2021-07-17T04:04:06', 'last_updated': None, 'name_server': 'ns76.domaincontrol.com\r', 'status': 'clientDeleteProhibited'},
    {}
])

table = pa.Table.from_pandas(df, preserve_index=False)
pq.write_table(table, '/tmp/tst2.pa')
df = pd.read_parquet('/tmp/tst2.pa')

Another option is to flatten your table directly in pandas:

df = pd.DataFrame({'whois':[
{'registrar': 'GoDaddy.com, LLC', 'creation_date': '2020-07-17T16:10:35', 'expiration_date': '2022-07-17T16:10:35', 'last_updated': None, 'name_server': 'ns59.domaincontrol.com\r', 'status': 'clientDeleteProhibited'},
{'registrar': 'Hongkong Domain Name Information Management Co., Limited', 'creation_date': '2020-07-17T10:28:36', 'expiration_date': '2021-07-17T10:28:36', 'last_updated': None, 'name_server': 'ns2.alidns.com\r', 'status': 'ok'},
{'registrar': 'GoDaddy.com, LLC', 'creation_date': '2020-07-17T04:04:06', 'expiration_date': '2021-07-17T04:04:06', 'last_updated': None, 'name_server': 'ns76.domaincontrol.com\r', 'status': 'clientDeleteProhibited'},
None
]})
table = pa.Table.from_pandas(df, preserve_index=False).flatten()
df = pd.read_parquet('/tmp/tst2.pa')
df = pd.read_parquet('/tmp/tst2.pa')

As a side note you may want to provide your own schema as pandas and arrow are trying to guess the type of the columns but they are not doing a good job for null columns (last_updated is defaulting to float or null)

> table.schema
creation_date: string
expiration_date: string
last_updated: double
name_server: string
registrar: string
status: string

So instead you could do something like:

df = pd.DataFrame([
    {'registrar': 'GoDaddy.com, LLC', 'creation_date': '2020-07-17T16:10:35', 'expiration_date': '2022-07-17T16:10:35', 'last_updated': None, 'name_server': 'ns59.domaincontrol.com\r', 'status': 'clientDeleteProhibited'},
    {'registrar': 'Hongkong Domain Name Information Management Co., Limited', 'creation_date': '2020-07-17T10:28:36', 'expiration_date': '2021-07-17T10:28:36', 'last_updated': None, 'name_server': 'ns2.alidns.com\r', 'status': 'ok'},
    {'registrar': 'GoDaddy.com, LLC', 'creation_date': '2020-07-17T04:04:06', 'expiration_date': '2021-07-17T04:04:06', 'last_updated': None, 'name_server': 'ns76.domaincontrol.com\r', 'status': 'clientDeleteProhibited'},
    {}
])

table_schema = pa.schema([
    pa.field('creation_date', pa.string()),
    pa.field('expiration_date', pa.string()),
    pa.field('last_updated', pa.string()),
    pa.field('name_server', pa.string()),
    pa.field('registrar', pa.string()),
    pa.field('status', pa.string()),
    
])

table = pa.Table.from_pandas(df, preserve_index=False)
pq.write_table(table, '/tmp/tst2.pa')
df = pd.read_parquet('/tmp/tst2.pa')