UPDATE: it seems that the recently released org.apache.beam.sdk.io.hbase-2.6.0
includes the HBaseIO.readAll()
api. I tested in google dataflow, and it seems to be working. Will there be any issue or pitfall of using HBaseIO
directly in Google Cloud Dataflow setting?
The BigtableIO.read
takes PBegin
as an input, I am wondering if there is anything like SpannerIO
's readAll
API, where the BigtableIO
's read API input could be a PCollection
of ReadOperations
(e.g, Scan), and produce a PCollection<Result>
from those ReadOperation
s.
I have a use case where I need to have multiple prefix scans, each with different prefix, and the number of rows with the same prefix can be small (a few hundred) or big (a few hundreds of thousands). If nothing like ReadAll
is already available. I am thinking about having a DoFn
to have a 'limit' scan, and if the limit scan doesn't reach the end of the key range, I will split it into smaller chunks. In my case, the key space is uniformly distributed, so the number of remaining rows can be well estimated by the last scanned row (assuming all keys smaller than the last scanned key is returned from the scan).
Apology if similar questions have been asked before.