3
votes

I am trying to read an an Avro file using the python avro library (python 2). When I use the following code:

import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter, BinaryDecoder
reader = DataFileReader(open("filename.avro", "rb"), DatumReader())
schema = reader.meta

Then it reads every column correctly, except for one which remains as bytes, rather than the expected decimal values.

How can I convert this column to the expected decimal values? I notice that the file's metadata identifies the column as 'type' : 'bytes', but 'logicalType' :'decimal'

I post below the metadata for this column, as well as the byte values (expected actual values are all multiples of 1,000 less than 25,000. The file was created using Kafka.

Metadata:

 {
                            "name": "amount",
                            "type": {
                                "type": "bytes",
                                "scale": 8,
                                "precision": 20,
                                "connect.version": 1,
                                "connect.parameters": {
                                    "scale": "8",
                                    "connect.decimal.precision": "20"
                                },
                                "connect.name": "org.apache.kafka.connect.data.Decimal",
                                "logicalType": "decimal"
                            }
                        }

Byte values:

'E\xd9d\xb8\x00'
'\x00\xe8\xd4\xa5\x10\x00'
'\x01\x17e\x92\xe0\x00'
'\x01\x17e\x92\xe0\x00'

Expected values:

3,000.00
10,000.00
12,000.00
5,000.00

I need to use this within a Lambda function deployed on AWS, so cannot use fast_avro, or other libraries using C rather than pure Python.

See links below: https://pypi.org/project/fastavro/ https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html

3

3 Answers

3
votes

To do this you will need to use the fastavro library. Both the avro and avro-python3 libraries do not support logical types at the time of posting this.

1
votes

You can use this to decode the byte string into decimal. This pads the value to the next highest byte structure so all possible values will fit.

import struct
from decimal import Decimal

def decode_decimal(value, num_places):
    value_size = len(value)
    for fmt in ('>b', '>h', '>l', '>q'):
        fmt_size = struct.calcsize(fmt)
        if fmt_size >= value_size:
            padding = b'\x00' * (fmt_size - value_size)
            int_value = struct.unpack(fmt, padding + value)[0]
            scale = Decimal('1') / (10 ** num_places)
            return Decimal(int_value) * scale
    raise ValueError('Could not unpack value')

Ex:

>>> decode_decimal(b'\x00\xe8\xd4\xa5\x10\x00', 8)
Decimal('10000.00000000')
>>> decode_decimal(b'\x01\x17e\x92\xe0\x00', 8)
Decimal('12000.00000000')
>>> decode_decimal(b'\xb2\xb4\xe7\x84', 4)  # Negative value
Decimal('-129676.7100')

Refs:

https://avro.apache.org/docs/1.10.2/spec.html#Decimal https://docs.python.org/3/library/struct.html#format-characters

0
votes

For some reason, the fastavro package works as default on the same file. I ended up using the code below. Still not sure if there is a way to address this problem directly using the avro library, or to deserialise the output posted in the question above.

import fastavro
with open("filename.avro", 'rb') as fo: 
    for record in fastavro.reader(fo): 
        print(record)