How to use Google Pub/Sub with Google Dataflow/Beam using Python?

Question

I'm new to Pub/Sub and Dataflow/Beam. I have done a task in Spark and Kafka, I want to do the same using Pub/Sub and Dataflow/Beam. From what I understood so far Kafka is similar to Pub/Sub and Spark is similar to Dataflow/Beam.

The task is take a JSON file and write to a Pub/Sub topic. Then using Beam/Dataflow I need to get that data into a PCollection. How will I achieve this?

The Apache Beam Python SDK does not support reading from Pub/Sub. Reference: Built-in I/O Transforms. — Andrew Nguonly

Minato Minato · Accepted Answer · 2018-03-21T05:52:00

I solved the above problem. I'm able to continuously read data from a pubsub topic and then do some processing and then write the result to a datastore.

with beam.Pipeline(options=options) as p:

    # Read from PubSub into a PCollection.
    lines = p | beam.io.ReadStringsFromPubSub(topic=known_args.input_topic)

    # Group and aggregate each JSON object.
    transformed = (lines
                   | 'Split' >> beam.FlatMap(lambda x: x.split("\n"))
                   | 'jsonParse' >> beam.ParDo(jsonParse())
                   | beam.WindowInto(window.FixedWindows(15,0))
                   | 'Combine' >> beam.CombinePerKey(sum))

    # Create Entity.
    transformed = transformed | 'create entity' >> beam.Map(
      EntityWrapper(config.NAMESPACE, config.KIND, config.ANCESTOR).make_entity)

    # Write to Datastore.
    transformed | 'write to datastore' >> WriteToDatastore(known_args.dataset_id)

How to use Google Pub/Sub with Google Dataflow/Beam using Python?

2 Answers