In my Apache Beam pipeline I have a PCollection of Row objects (org.apache.beam.sdk.values.Row). I want write to Avro files. Here is a simplified version of my code:
Pipeline p = Pipeline.create();
Schema inputSchema = Schema.of(
Schema.Field.of("MyField1", Schema.FieldType.INT32)
);
Row row = Row.withSchema(inputSchema).addValues(1).build();
PCollection<Row> myRow = p.apply(Create.of(row)).setCoder(RowCoder.of(inputSchema));
myRow.apply(
"WriteToAvro",
AvroIO.write(Row.class)
.to("src/tmp/my_files")
.withWindowedWrites()
.withNumShards(10));
p.run();
The file gets created, but it looks like this (in JSON form):
"schema" : {
"fieldIndices" : {
"MyField1" : 0
},
"encodingPositions" : {
"MyField1" : 0
},
"fields" : [
{
}
],
"hashCode" : 545134948,
"uuid" : {
},
"options" : {
"options" : {
}
}
}
So only the schema is there with bunch of useless metadata. What's the right way of writing to Avro from Row objects so that I have the data and not just the schema. And can I get rid of the metadata?