Why does BigQuery fail to parse an Avro file that is accepted by avro-tools?

Question

I'm trying to export google cloud datastore data to Avro files in google cloud storage and then load those files into BigQuery.

Firstly, I know that Big Query loads datastore backups. This has several disadvantages that I'd like to avoid:

Backup tool is closed source
Backup tool format is undocumented.
Backup tool format cannot be read directly by Dataflow
Backup scheduling for appengine is in (apparently perpetual) alpha.
It is possible to implement your own backup handler in appengine, but it is fire and forget. You won't know when exactly the backup has finished or what the file name will be.

With the motivation clarified for this experiment here is my Dataflow Pipeline to export the data to avro format:

package com.example.dataflow;

import com.google.api.services.datastore.DatastoreV1;
import com.google.api.services.datastore.DatastoreV1.Entity;
import com.google.cloud.dataflow.sdk.Pipeline;
import com.google.cloud.dataflow.sdk.coders.AvroCoder;
import com.google.cloud.dataflow.sdk.io.AvroIO;
import com.google.cloud.dataflow.sdk.io.DatastoreIO;
import com.google.cloud.dataflow.sdk.io.Read;
import com.google.cloud.dataflow.sdk.options.DataflowPipelineOptions;
import com.google.cloud.dataflow.sdk.options.PipelineOptions;
import com.google.cloud.dataflow.sdk.options.PipelineOptionsFactory;
import com.google.cloud.dataflow.sdk.transforms.DoFn;
import com.google.cloud.dataflow.sdk.transforms.ParDo;
import org.apache.avro.Schema;
import org.apache.avro.file.DataFileReader;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.file.SeekableByteArrayInput;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.io.DatumReader;
import org.apache.avro.protobuf.ProtobufData;
import org.apache.avro.protobuf.ProtobufDatumWriter;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.ByteArrayOutputStream;

public class GCDSEntitiesToAvroSSCCEPipeline {

    private static final String GCS_TARGET_URI = "gs://myBucket/datastore/dummy";
    private static final String ENTITY_KIND = "Dummy";

    private static Schema getSchema() {
        return ProtobufData.get().getSchema(Entity.class);
    }

    private static final Logger LOG = LoggerFactory.getLogger(GCDSEntitiesToAvroSSCCEPipeline.class);
    public static void main(String[] args) {
        PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
        Pipeline p = Pipeline.create(options);

        DatastoreV1.Query.Builder q = DatastoreV1.Query.newBuilder()
                .addKind(DatastoreV1.KindExpression.newBuilder().setName(ENTITY_KIND));

        p.apply(Read.named("DatastoreQuery").from(DatastoreIO.source()
                .withDataset(options.as(DataflowPipelineOptions.class).getProject())
                .withQuery(q.build())))
            .apply(ParDo.named("ProtoBufToAvro").of(new ProtoBufToAvro()))
            .setCoder(AvroCoder.of(getSchema()))
            .apply(AvroIO.Write.named("WriteToAvro")
                    .to(GCS_TARGET_URI)
                    .withSchema(getSchema())
                    .withSuffix(".avro"));
        p.run();

    }

    private static class ProtoBufToAvro extends DoFn<Entity, GenericRecord> {
        private static final long serialVersionUID = 1L;

        @Override
        public void processElement(ProcessContext c) throws Exception {
            Schema schema = getSchema();
            ProtobufDatumWriter<Entity> pbWriter = new ProtobufDatumWriter<>(Entity.class);
            DataFileWriter<Entity> dataFileWriter = new DataFileWriter<>(pbWriter);
            ByteArrayOutputStream bos = new ByteArrayOutputStream();
            dataFileWriter.create(schema, bos);
            dataFileWriter.append(c.element());
            dataFileWriter.close();

            DatumReader<GenericRecord> datumReader = new GenericDatumReader<>(schema);
            DataFileReader<GenericRecord> dataFileReader = new DataFileReader<>(
                    new SeekableByteArrayInput(bos.toByteArray()), datumReader);

            c.output(dataFileReader.next());

        }
    }
}

The pipeline runs fine, however when I try to load the resultant Avro file into big query I get the following error:

bq load --project_id=roodev001 --source_format=AVRO dummy.dummy_1 gs://roodev001.appspot.com/datastore/dummy-00000-of-00001.avro
Waiting on bqjob_r5c9b81a49572a53b_00000154951eb523_1 ... (0s) Current status: DONE   
BigQuery error in load operation: Error processing job 'roodev001:bqjob_r5c9b81a49572a53b_00000154951eb523_1': The Apache Avro library failed to parse file
gs://roodev001.appspot.com/datastore/dummy-00000-of-00001.avro.

However if I load the resultant avro file with avro tool, everything is just fine:

avro-tools tojson datastore-dummy-00000-of-00001.avro | head
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
{"key":{"com.google.api.services.datastore.DatastoreV1$.Key":{"partition_id":{"com.google.api.services.datastore.DatastoreV1$.PartitionId":{"dataset_id":"s~roodev001","namespace":""}},"path_element":[{"kind":"Dummy","id":4503905778008064,"name":""}]}},"property":[{"name":"number","value":{"boolean_value":false,"integer_value":879,"double_value":0.0,"timestamp_microseconds_value":0,"key_value":null,"blob_key_value":"","string_value":"","blob_value":"","entity_value":null,"list_value":[],"meaning":0,"indexed":true}}]}
...

I used this code to populate the datastore with dummy data before running the Dataflow pipeline:

package com.example.datastore;

import com.google.gcloud.AuthCredentials;
import com.google.gcloud.datastore.*;

import java.io.IOException;

public static void main(String[] args) throws IOException {

    Datastore datastore = DatastoreOptions.builder()
            .projectId("myProjectId")
            .authCredentials(AuthCredentials.createApplicationDefaults())
            .build().service();

    KeyFactory dummyKeyFactory = datastore.newKeyFactory().kind("Dummy");


    Batch batch = datastore.newBatch();
    int batchCount = 0;
    for (int i = 0; i < 4000; i++){
        IncompleteKey key = dummyKeyFactory.newKey();
        System.out.println("adding entity " + i);
        batch.add(Entity.builder(key).set("number", i).build());
        batchCount++;
        if (batchCount > 99) {
            batch.submit();
            batch = datastore.newBatch();
            batchCount = 0;
        }
    }

    System.out.println("done");

}

So why is BigQuery rejecting my avro files?

What codec is used in the avro file for block compression? BigQuery supports only DEFLATE at the moment. See cloud.google.com/bigquery/data-formats#avro_format. — Michael Sheldon
@MichaelSheldon As far as I can tell there is no block compression used in the file. The Avro tools getmeta command does not print a value for avro.codec. — Frank Wilson
Heres a link to the output avro file of dummy entities: drive.google.com/open?id=0B4dY1dqkTY14VmZnc0pCUjNrMEE — Frank Wilson
I know you probably have good reasons that you are choosing to do things this way, but have you considered having your Dataflow pipeline write directly to BigQuery? — Kenn Knowles

Hua Zhang Hua Zhang · Accepted Answer · 2016-05-21T00:22:19

BigQuery uses the C++ Avro library, and apparently it doesn't like the "$" in the namespace. Here's the error message:

Invalid namespace: com.google.api.services.datastore.DatastoreV1$

We're working on getting these Avro error messages out to the end user.

Why does BigQuery fail to parse an Avro file that is accepted by avro-tools?

1 Answers