Persisting index map in Apache Beam pipeline on Google Cloud Dataflow

Question

I'm designing a pipeline with the following functionality:

Read Events from different Pub/Sub topics providing me with objects from which I can extract a StrId (String)
Load a mapping table from Bigtable with KV(StrId, IntId> where IntId is a unique integer
Look up the StrId in that mapping:
- If StrId is found, return corresponding IntId
- If StrId is not found, generate a new IntId sequentially, add it to the mapping and also write it to Bigtable
Pass the object and IntId downstream

I'm wondering whether the state approach would fit my needs here, and whether Bigtable is the right storage technology to use? The mapping between StrId and IntId would have to be persisted across all workers in order to keep IntIds unique.

Also, any links to code examples would be greatly appreciated. I'm aware of this Stackoverflow Question and this blog post.

(For the downstream calculations, I need integer Ids, so there's no way around that)

Solomon Duskis Solomon Duskis · Accepted Answer · 2018-06-21T13:25:25

This sounds very much like what OpenTSDB does to manage strings in its tsdb-uid table. That process requires a combination of increment (aka ReadModifyWrite) to get a unique id (which is an int64 / long), and a CheckAndMutate to ensure that you only have one unique mapping. It's a more difficult process than what you get out of SQL systems.

That said, Cloud Bigtable is not ideal for managing small tables like the uid table (i.e. less than a couple of GB). If you're using Cloud Bigtable to store lots of data, you can consider using Cloud Bigtable for the uid table as well. However, I would still suggest also looking at a SQL alternative for that functionality.

Persisting index map in Apache Beam pipeline on Google Cloud Dataflow

1 Answers