Read from Google PubSub and then read from Bigtable based on the PubSub message Topic

Question

In Beam(Dataflow 2.0.0), I am reading a PubSub topic and then trying to fetch few rows from Bigtable based on the message from the topic. I couldn't find a way to scan the BigTable based on the pubsub messages through Beam documentation. I tried to write ParDo function and pipe it into the beam pipeline but in vain.

The BigTableIO gives an option of read but that is outside of pipeline and am not sure it would work in the steaming fashion as my use-case.

Can anyone please let me know if this is doable as in streaming PubSub and read BigTable based on the message content.

P.S: I am using Java API in Beam 2.0.

    PCollection<String> keyLines = 
                pipeline.apply(PubsubIO.readMessagesWithAttributes()
                .fromSubscription("*************"))
                .apply("PubSub Message to Payload as String", 
                     ParDo.of(new PubSubMessageToStringConverter()));

Now I want the keyLines to act as the row keys to scan the BigTable. I am using the below code snippet from BigTable. I can see 'RowFilter.newBuilder()' and 'ByteKeyRange' but both of them seems work in batch mode not in streaming fashion.

   pipeline.apply("read",
                BigtableIO.read()
                     .withBigtableOptions(optionsBuilder)
                     .withTableId("**********");

    pipeline.run();

Please advise.

Raghu Angadi Raghu Angadi · Accepted Answer · 2017-07-10T22:07:45

You should be able to read from BigTable in a ParDo. You would have to use Cloud Big Table or HBase API directly. It is better to initialize the client in @Setup method in your DoFn (example). Please post more details if it does not work.

Read from Google PubSub and then read from Bigtable based on the PubSub message Topic

1 Answers