0
votes

I'm trying to build a streaming/batch pipeline which read events from Pub/Sub and write them into BigQuery using python3.6

According to the documentation, Cloud Pub/Sub assigns a unique message_id and timestamp to each message, which can be used to detect duplicate messages received by the subscriber. (https://cloud.google.com/pubsub/docs/faq)

Requirement is as below: 1) Messages can come in any order(asynchronous messaging) 2) Unique Id and timestamp(when the record was pushed to pubsub topic / pulled from topic) should be captured to the existing record.

Input Data :
Name Age 
xyz  21
Output Record to bigquery:
Name  Age  Unique_Id    Record_Timestamp(time when it was written to topic or pulled from topic)

xyz   21    hdshdfd_12  2019-10-16 12:06:54

Can anyone provide me some link on how to handle this or inputs if we can execute this through pubsub

1

1 Answers

1
votes

You have all the wished data in the PubSub message. Be careful, the format of PubSub message is slightly different if you use function event format. The timestamp and the messageId are in the context object

Message format

   {
     "message": {
       "attributes": {
         "key": "value"
       },
       "data": "SGVsbG8gQ2xvdWQgUHViL1N1YiEgSGVyZSBpcyBteSBtZXNzYWdlIQ==",
       "messageId": "136969346945",
       "publishTime": "2014-10-02T15:01:23.045123456Z"
     },
     "subscription": "projects/myproject/subscriptions/mysubscription"
   }

Data are base64 encoded. I recommend you to make a JSON with your data and to publish this json in PubSub. You can get a base64 string but after decoding, you have your JSON and you can use your data.

You also have the publishedTime, time of publication in the PubSub. You don't have the pull time. You can have it by doing a time = Now() in your message handler.

Now, you have all your data. You can build a BigQuery row with <your data>,messageId,publishedTime. You can build a file (a CSV or JSON line for example) and perform a load job into bigquery, or use stream write. All depends on your requirements/preferences.