I am trying to find out if there is any GCP Dataflow template available for data ingestion with "Pub/Sub to Cloud Spanner". I have found there is already a default GCP dataflow template available with example - "Cloud Pub/Sub to BigQuery". So, I am interested to see if I can do data ingestion to spanner in stream or batch mode and how the behavior would be
2 Answers
There is a Dataflow template to import Avro files in batch mode that you can use by following these instructions. Unfortunately a Cloud Pub/Sub streaming template is not available yet. If you would like, you can file a feature request.
Actually I tried to do something like use "projects/pubsub-public-data/topics/taxirides-realtime" and "gs://dataflow-templates/latest/Cloud_PubSub_to_Avro" template to load sample data file to my gcp storage. Then I stopped this stream job and created another batch job with "gs://dataflow-templates/latest/GCS_Avro_to_Cloud_Spanner" template. But the batch job failed with below error,
java.io.FileNotFoundException: No files matched spec: gs://cardataavi/archive/spanner-export.json
at org.apache.beam.sdk.io.FileSystems.maybeAdjustEmptyMatchResult(FileSystems.java:166)
at org.apache.beam.sdk.io.FileSystems.match(FileSystems.java:153)
at org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn.process(FileIO.java:636)
It seems, right now spanner support only Avro data format which has Spanner specific format. Is the understanding correct?