We are looking to stream the PubSubmessage(json string) from Pub-Sub using Dataflow and then write in Cloud storage. I am wondering what would be best dataformat while writing the data to Cloud storage? My further use case might also involve using Dataflow to read from Cloud storage again for further operations to persist to Data lake based on the need. Few of the options i was thinking:
a) Use Dataflow to directly write as json string itself to Cloud storage? I assume every line in the file in the Cloud storage is to be treated as a single message if reading from Cloud storage and then if processing for further operations to Datalake, right?
b) Transform the json to a text file format using Dataflow and save in Cloud storage
c) Any other options?
0
votes
1 Answers
0
votes
You could store your data with the JSON format for further use in BigQuery if you need to analyze your data later. The Dataflow solution that you're mentioning on the a) option will be a good way to handle your scenario. Additionally, you could use Cloud functions with a Pub/Sub trigger then write the content to cloud storage. You could use the code shown in this tutorial as a base for this scenario as this put the information in a topic, then gather the message from the topic and creates a cloud storage object with the message as its content.