1
votes

I have created a parquet file with a decimal column type pa.decimal128(12, 4) using pyarrow. After I read the file and access its metadata I get the following output:

<pyarrow._parquet.ColumnChunkMetaData object at 0x7f4752644310>
  file_offset: 26077
  file_path: 
  physical_type: FIXED_LEN_BYTE_ARRAY
  num_values: 3061
  path_in_schema: Price
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x7f4752644360>
      has_min_max: True
      min: b'\x00\x00\x00\x00\x9b\xdc'
      max: b'\x00\x00w5\x93\x9c'
      null_count: 0
      distinct_count: 0
      num_values: 3061
      physical_type: FIXED_LEN_BYTE_ARRAY
      logical_type: Decimal(precision=12, scale=4)
      converted_type (legacy): DECIMAL
  compression: SNAPPY
  encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
  has_dictionary_page: True
  dictionary_page_offset: 22555
  data_page_offset: 23225
  total_compressed_size: 3522
  total_uncompressed_size: 3980

As you can see the min/max values are actually byte objects. How would I decode these to actual decimal values?

I tried casting it with

pc.cast(statistics.max, pa.decimal128(12, 4))

but got the following error message instead

pyarrow.lib.ArrowNotImplementedError: Unsupported cast from binary to decimal using function cast_decimal

1
For decimal128 you'd expect to have 16 bytes, but here you only have 6. Maybe they've been truncated. In which case try int.from_bytes(b'\x00\x00\x00\x00\x9b\xdc', 'big') / 1_000) - 0x26res
BTW, there is an open feature request for decoding these to a better type. - Micah Kornfield

1 Answers

1
votes

The statistics are based on the physical type and not the logical type. For Decimal(precision=12, scale=4) the physical type is FIXED_LEN_BYTE_ARRAY which is what the min and max are. Unfortunately, in order to convert back to a decimal, you'll need to know how Arrow is encoding to a fixed length byte array.

It first determines how many bytes are needed based on the precision. You won't need to reverse engineer this part. It then converts to big endian, truncates down to the needed bytes and writes them. So this should allow you to convert back.

def pad(b):
  # Left pad 0 or 1 based on leading digit (2's complement rules)
  if b[-1] & 128 == 0:
    return b.ljust(16, b'\x00')
  else:
    return b.ljust(16, b'\xff')

def to_pyarrow_bytes(b):
  # converts from big-endian (parquet's repr) to little endian (arrow's repr)
  # and then pads to 16 bytes
  return pad(b[::-1])

def decode_stats_decimal(b):
  pyarrow_bytes = to_pyarrow_bytes(b)
  arr = pa.Array.from_buffers(dtype, 1, [None, pa.py_buffer(pyarrow_bytes)], 0)
  return arr[0].as_py()

decode_stats_decimal(statistics.max)
# Decimal('199999.9900')
decode_stats_decimal(statistics.min)
# Decimal('3.9900')