Google Cloud Dataflow - From PubSub to Parquet

Question

I'm trying to write Google PubSub messages to Google Cloud Storage using Google Cloud Dataflow. The PubSub messages come into json format and the only operation that I want to perform is a transformation from json to parquet file.

In the official documentation I found a template provided by google that reads data from a Pub/Sub topic and writes Avro files into the specified Cloud Storage bucket (https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming#pubsub-to-cloud-storage-avro). The problem is that the template source code is written in Java, while I prefer to use the Python SDK.

These are the first tests I'm doing with Dataflow and Beam in general, and there's not a lot of material online to take a hint from. Any suggestions, links, guidance, piece of code would be greatly appreciated.

Does this documentation help? beam.apache.org/releases/pydoc/2.22.0/… — Peter Kim
Since you are looking for general advice, I can tell you that there is a PTtransform method called WriteToParquet() in Apache Beam and it is used to write to Parquet, link. In addition, Google provides this code which shows how to write to GCS reading fro PubSub, here. So you could read your messages then, write then to a parquet file ( providing the schema) and finally store it in GCS. Did it help you? — Alexandre Moraes
Yes, thank you so much @AlexandreMoraes! I had seen the PTtransform method called WriteToParquet(), but I was looking for an example where they used it. The link to the code you sent me is a good place to start! — Federico Barusco
@fedex Here this should be a good start for you with the method WriteToParquet(). When you are moving forward with your code you can post other questions and get more specific help. Meanwhile, can I summarise this information I shared as an answer in order to further contribute to the community? — Alexandre Moraes
@AlexandreMoraes yes of course, It can certainly help for any other person working on a similar use case. — Federico Barusco

Alexandre Moraes Alexandre Moraes · Accepted Answer · 2020-07-22T09:14:36

In order to further contribute to the community, I am summarising our discussing as an answer.

Since you are starting with Dataflow, I can point out some useful topics and advice:

The PTransform WriteToParquet() builtin method in Apache Beam is very useful. It writes to a Parquet file from a PCollection of records. Also, in order to use it and write to a parquet file, you would need to specify the schema as indicated in the documentation. In addition, this article will help you understand better how to use this method and how to write it in a Google Cloud Storage(GCS) bucket.
Google provides this code explaining how read messages from PubSub and write them into Google Cloud Storage. This QuickStart reads the message from PubSub and write the messages from each window to a bucket.
Since you want to read from PubSub, write the message to Parquet and store the file in a GCS bucket, I would advise you to do the following process as steps of your pipeline: Read your messages, write to a parquet file and store it in GCS.

I encourage you to read the above links. Then if you have any other question you can post another thread in order to get more specific help.

Google Cloud Dataflow - From PubSub to Parquet

1 Answers