0
votes

I am running a streaming pipeline using beam / dataflow. I am reading my input from pub/sub as converting into a dict as below:

    raw_loads_dict = (p 
      | 'ReadPubsubLoads' >> ReadFromPubSub(topic=PUBSUB_TOPIC_NAME).with_output_types(bytes)
      | 'JSONParse' >> beam.Map(lambda x: json.loads(x)) 
    )

Since this is done on each element of a high throughput pipeline I am worried that this is not the most efficient way to do this?

What is the best practice in this case, considering I am then manipulating the data in some cases, but could potentially stream it directly into bigquery.

1

1 Answers

2
votes

This approach is fine unless there's something extremely inefficient happening or you have some specific concern (e.g. some metric you observe doesn't seem right). JSON parsing seems lightweight enough for this not to be a problem. Beam pipeline runners can even potentially fuse multiple operations like that so that they are executed on the same machine for efficiency to avoid transferring data between worker machines.

A major situation where you can start seeing performance issues would probably involve either external systems (e.g. network latencies or throttling when calling external services), or grouping operations (e.g. implementing joins using GroupByKey/CoGroupByKey) where the data needs to be aggregated in a persistent store somewhere and needs to be transferred between worker machines (shuffle operation). In these situations though the costs of JSON parsing or running some relatively simple transformation code per-element would likely be negligible compared to network, persistence and other related costs.