Scenario - Client serializes a POJO using Avro Reflect Datum Writer and writes GenericRecord to a file. The schema obtained through reflection is something like this (Note the ordering A, B, D, C) -
{
"namespace": "storage.management.example.schema",
"type": "record",
"doc": "Example schema for testing",
"name": "Event",
"fields": [
....
....
{ "name": "A", "type": "string" },
{ "name": "B", "type": "string" },
{ "name": "D", "type": "string" },
{ "name": "C", "type": "string" },
....
....
]
}
An agent reads off the file and uses a default schema (Note the ordering - A, B, C, D)to deserialize a subset of the record (The client is guaranteed to have these fields)
{
"namespace": "storage.management.example.schema",
"type": "record",
"doc": "Example schema for testing",
"name": "Event",
"fields": [
{ "name": "A", "type": "string" },
{ "name": "B", "type": "string" },
{ "name": "C", "type": "string" },
{ "name": "D", "type": "string" }
]
}
The problem : De-serialization with the above subset schema results in the following exception -
Caused by: java.io.IOException: Invalid int encoding
at org.apache.avro.io.BinaryDecoder.readInt(BinaryDecoder.java:145)
at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:259)
at org.apache.avro.io.ResolvingDecoder.readString(ResolvingDecoder.java:201)
at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:430)
at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:422)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:180)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:240)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:230)
However, de-serialization succeeds if the subset schema also specifies fields in the order A, B, D, C. (same as client schema)
Is this behavior expected? I though Avro only depends on field name to build the record and not the ordering.
Any fixes to this ? Different clients may have different orders and I have no way to enforce ordering because schema is generated through reflection.
import org.apache.avro.file.DataFileReader- Achilleus