I am running a streaming pipeline using beam / dataflow. I am reading my input from pub/sub as converting into a dict as below:
raw_loads_dict = (p
| 'ReadPubsubLoads' >> ReadFromPubSub(topic=PUBSUB_TOPIC_NAME).with_output_types(bytes)
| 'JSONParse' >> beam.Map(lambda x: json.loads(x))
)
Since this is done on each element of a high throughput pipeline I am worried that this is not the most efficient way to do this?
What is the best practice in this case, considering I am then manipulating the data in some cases, but could potentially stream it directly into bigquery.