7
votes

I have a Google bigQuery Table and I want to stream the entire table into pub-sub Topic

what should be the easy/fast way to do it?

Thank you in advance,

2

2 Answers

13
votes

2019 update:

Now it's really easy with a click-to-bigquery option in Pub/Sub:

enter image description here

Find it on: https://console.cloud.google.com/cloudpubsub/topicList


The easiest way I know of is going through Google Cloud Dataflow, which natively knows how to access BigQuery and Pub/Sub.

In theory it should be as easy as the following Python lines:

p = beam.Pipeline(options=pipeline_options)
tablerows = p | 'read' >> beam.io.Read(
  beam.io.BigQuerySource('clouddataflow-readonly:samples.weather_stations'))
tablerows | 'write' >> beam.io.Write(
  beam.io.PubSubSink('projects/fh-dataflow/topics/bq2pubsub-topic'))

This combination of Python/Dataflow/BigQuery/PubSub doesn't work today (Python Dataflow is in beta, but keep an eye on the changelog).

We can do the same with Java, and it works well - I just tested it. It runs either locally, and also in the hosted Dataflow runner:

Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());

PCollection<TableRow> weatherData = p.apply(
        BigQueryIO.Read.named("ReadWeatherStations").from("clouddataflow-readonly:samples.weather_stations"));
weatherData.apply(ParDo.named("tableRow2string").of(new DoFn<TableRow, String>() {
    @Override
    public void processElement(DoFn<TableRow, String>.ProcessContext c) throws Exception {
        c.output(c.element().toString());
    }
})).apply(PubsubIO.Write.named("WriteToPubsub").topic("projects/myproject/topics/bq2pubsub-topic"));

p.run();

Test if the messages are there with:

gcloud --project myproject beta pubsub subscriptions  pull --auto-ack sub1

Hosted Dataflow screenshot:

Hosted Dataflow at work

3
votes

That really depends on the size of the table.

If it's a small table (a few thousand records, a couple doze columns) then you could setup a process to query the entire table, convert the response into a JSON array, and push to pub-sub.

If it's a big table (millions/billions of records, hundreds of columns) you'd have to export to file, and then prepare/ship to pub-sub

It also depends on your partitioning policy - if your tables are set up to partition by date you might be able to, again, query instead of export.

Last but not least, it also depends on the frequency - is this a one time deal (then export) or a continuous process (then use table decorators to query only the latest data)?

Need some more information if you want a truly helpful answer.

Edit

Based on your comments for the size of the table, I think the best way would be to have a script that would:

  1. Export the table to GCS as newline delimited JSON

  2. Process the file (read line by line) and send to pub-sub

There are client libraries for most programming languages. I've done similar things with Python, and it's fairly straight forward.