6
votes

I'm setting up a simple Proof of Concept to learn some of the concepts in Google Cloud, specifically PubSub and Dataflow.

I have a PubSub topic greeting

I've created a simple cloud function that sends publishes a message to that topic:

const escapeHtml = require('escape-html');
const { Buffer } = require('safe-buffer');
const { PubSub } = require('@google-cloud/pubsub');

exports.publishGreetingHTTP = async (req, res) => {
    let name = 'no name provided';
    if (req.query && req.query.name) {
        name = escapeHtml(req.query.name);
    } else if (req.body && req.body.name) {
        name = escapeHtml(req.body.name);
    }
    const pubsub = new PubSub();
    const topicName = 'greeting';
    const data = JSON.stringify({ hello: name });
    const dataBuffer = Buffer.from(data);
    const messageId = await pubsub.topic(topicName).publish(dataBuffer);
    res.send(`Message ${messageId} published. name=${name}`);
};

I set up a different cloud function that it triggered by the topic:

const { Buffer } = require('safe-buffer');

exports.subscribeGreetingPubSub = (data) => {
    const pubSubMessage = data;
    const passedData = pubSubMessage.data ? JSON.parse(Buffer.from(pubSubMessage.data, 'base64').toString()) : { error: 'no data' };

    console.log(passedData);
};

This works great and I see it registered as a subscription on the topic.

Now I want to send the use Dataflow to send the data to BigQuery

There appear to be 2 template to accomplish this:

I don't understand the difference between Topic and Subscription in this context.

https://medium.com/google-cloud/new-updates-to-pub-sub-to-bigquery-templates-7844444e6068 sheds a little bit of lights:

Note that a caveat of using subscriptions over topics is that subscriptions are only read once while topics can be read multiple times. Therefore a subscription template cannot support multiple concurrent pipelines reading the same subscription.

But I must say I'm still lost to understand the real implications of this.

1

1 Answers

12
votes

If you use the Topic to BigQuery template, Dataflow will create a subscription behind the scenes for you that reads from the specified topic. If you use the Subscription to BigQuery template, you will need to provide your own subscription.

You can use Subscription to BigQuery templates to emulate the behavior of a Topic to BigQuery template by creating multiple subscription-connected BigQuery pipelines reading from the same topic.

For new deployments, using the Subscription to BigQuery template is preferred. If you stop and restart a pipeline using the Topic to BigQuery template, a new subscription will be created, which may cause you to miss some messages that were published while the pipeline was down. The Subscription to BigQuery template doesn't have this disadvantage, since it uses the same subscription even after the pipeline is restarted.