I'm trying to export google cloud datastore data to Avro files in google cloud storage and then load those files into BigQuery.
Firstly, I know that Big Query loads datastore backups. This has several disadvantages that I'd like to avoid:
- Backup tool is closed source
- Backup tool format is undocumented.
- Backup tool format cannot be read directly by Dataflow
- Backup scheduling for appengine is in (apparently perpetual) alpha.
- It is possible to implement your own backup handler in appengine, but it is fire and forget. You won't know when exactly the backup has finished or what the file name will be.
With the motivation clarified for this experiment here is my Dataflow Pipeline to export the data to avro format:
package com.example.dataflow;
import com.google.api.services.datastore.DatastoreV1;
import com.google.api.services.datastore.DatastoreV1.Entity;
import com.google.cloud.dataflow.sdk.Pipeline;
import com.google.cloud.dataflow.sdk.coders.AvroCoder;
import com.google.cloud.dataflow.sdk.io.AvroIO;
import com.google.cloud.dataflow.sdk.io.DatastoreIO;
import com.google.cloud.dataflow.sdk.io.Read;
import com.google.cloud.dataflow.sdk.options.DataflowPipelineOptions;
import com.google.cloud.dataflow.sdk.options.PipelineOptions;
import com.google.cloud.dataflow.sdk.options.PipelineOptionsFactory;
import com.google.cloud.dataflow.sdk.transforms.DoFn;
import com.google.cloud.dataflow.sdk.transforms.ParDo;
import org.apache.avro.Schema;
import org.apache.avro.file.DataFileReader;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.file.SeekableByteArrayInput;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.io.DatumReader;
import org.apache.avro.protobuf.ProtobufData;
import org.apache.avro.protobuf.ProtobufDatumWriter;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.ByteArrayOutputStream;
public class GCDSEntitiesToAvroSSCCEPipeline {
private static final String GCS_TARGET_URI = "gs://myBucket/datastore/dummy";
private static final String ENTITY_KIND = "Dummy";
private static Schema getSchema() {
return ProtobufData.get().getSchema(Entity.class);
}
private static final Logger LOG = LoggerFactory.getLogger(GCDSEntitiesToAvroSSCCEPipeline.class);
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
Pipeline p = Pipeline.create(options);
DatastoreV1.Query.Builder q = DatastoreV1.Query.newBuilder()
.addKind(DatastoreV1.KindExpression.newBuilder().setName(ENTITY_KIND));
p.apply(Read.named("DatastoreQuery").from(DatastoreIO.source()
.withDataset(options.as(DataflowPipelineOptions.class).getProject())
.withQuery(q.build())))
.apply(ParDo.named("ProtoBufToAvro").of(new ProtoBufToAvro()))
.setCoder(AvroCoder.of(getSchema()))
.apply(AvroIO.Write.named("WriteToAvro")
.to(GCS_TARGET_URI)
.withSchema(getSchema())
.withSuffix(".avro"));
p.run();
}
private static class ProtoBufToAvro extends DoFn<Entity, GenericRecord> {
private static final long serialVersionUID = 1L;
@Override
public void processElement(ProcessContext c) throws Exception {
Schema schema = getSchema();
ProtobufDatumWriter<Entity> pbWriter = new ProtobufDatumWriter<>(Entity.class);
DataFileWriter<Entity> dataFileWriter = new DataFileWriter<>(pbWriter);
ByteArrayOutputStream bos = new ByteArrayOutputStream();
dataFileWriter.create(schema, bos);
dataFileWriter.append(c.element());
dataFileWriter.close();
DatumReader<GenericRecord> datumReader = new GenericDatumReader<>(schema);
DataFileReader<GenericRecord> dataFileReader = new DataFileReader<>(
new SeekableByteArrayInput(bos.toByteArray()), datumReader);
c.output(dataFileReader.next());
}
}
}
The pipeline runs fine, however when I try to load the resultant Avro file into big query I get the following error:
bq load --project_id=roodev001 --source_format=AVRO dummy.dummy_1 gs://roodev001.appspot.com/datastore/dummy-00000-of-00001.avro
Waiting on bqjob_r5c9b81a49572a53b_00000154951eb523_1 ... (0s) Current status: DONE
BigQuery error in load operation: Error processing job 'roodev001:bqjob_r5c9b81a49572a53b_00000154951eb523_1': The Apache Avro library failed to parse file
gs://roodev001.appspot.com/datastore/dummy-00000-of-00001.avro.
However if I load the resultant avro file with avro tool, everything is just fine:
avro-tools tojson datastore-dummy-00000-of-00001.avro | head
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
{"key":{"com.google.api.services.datastore.DatastoreV1$.Key":{"partition_id":{"com.google.api.services.datastore.DatastoreV1$.PartitionId":{"dataset_id":"s~roodev001","namespace":""}},"path_element":[{"kind":"Dummy","id":4503905778008064,"name":""}]}},"property":[{"name":"number","value":{"boolean_value":false,"integer_value":879,"double_value":0.0,"timestamp_microseconds_value":0,"key_value":null,"blob_key_value":"","string_value":"","blob_value":"","entity_value":null,"list_value":[],"meaning":0,"indexed":true}}]}
...
I used this code to populate the datastore with dummy data before running the Dataflow pipeline:
package com.example.datastore;
import com.google.gcloud.AuthCredentials;
import com.google.gcloud.datastore.*;
import java.io.IOException;
public static void main(String[] args) throws IOException {
Datastore datastore = DatastoreOptions.builder()
.projectId("myProjectId")
.authCredentials(AuthCredentials.createApplicationDefaults())
.build().service();
KeyFactory dummyKeyFactory = datastore.newKeyFactory().kind("Dummy");
Batch batch = datastore.newBatch();
int batchCount = 0;
for (int i = 0; i < 4000; i++){
IncompleteKey key = dummyKeyFactory.newKey();
System.out.println("adding entity " + i);
batch.add(Entity.builder(key).set("number", i).build());
batchCount++;
if (batchCount > 99) {
batch.submit();
batch = datastore.newBatch();
batchCount = 0;
}
}
System.out.println("done");
}
So why is BigQuery rejecting my avro files?
getmeta
command does not print a value foravro.codec
. – Frank Wilson