0
votes

I need to read in an avro file from local or gcs, via java. I followed the example from docs from https://beam.apache.org/documentation/sdks/javadoc/2.0.0/index.html?org/apache/beam/sdk/io/AvroIO.html

Pipeline p = ...;

// A Read from a GCS file (runs locally and using remote execution):
Schema schema = new Schema.Parser().parse(new File("schema.avsc"));
PCollection<GenericRecord> records =
    p.apply(AvroIO.readGenericRecords(schema)
            .from("gs://my_bucket/path/to/records-*.avro"));

But when I try to process it through a DoFn there doesnt appear to be any data there. The avro file does have data and was able to run a function to generate a schema from it. If anybody has advice please share.

2
Are there any relevant log messages? Can you describe what the DoFn is doing? Can you post any more relevant code? Maybe post the full pipeline implementation. In the Dataflow UI, do you see the input element count remain at zero?Andrew Nguonly

2 Answers

2
votes

I absolutely agree with Andrew, more information would be required. However, I think you should consider using AvroIO.Read which is a more appropriate transform to read records from one or more Avro files.

https://cloud.google.com/dataflow/model/avro-io#reading-with-avroio

PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);

Schema schema = new Schema.Parser().parse(new File("schema.avsc"));

PCollection<GenericRecord> records =
p.apply(AvroIO.Read.named("ReadFromAvro")
                   .from("gs://my_bucket/path/records-*.avro")
                   .withSchema(schema));
0
votes

Hey guys thanks for looking into this. I can't share any code because they belong to clients. I did not receive any error messages, and the debugger did see data, but we were not able to see the data in the avro file (via pardo).

I did manage to fix the issue by recreating the dataflow project using the Eclipse wizard. I even used the same code. I wonder why I did not receive any error messages.