1
votes

In Dataflow 1.x versions, we could use CloudBigtableIO.writeToTable(TABLE_ID) to create, update, and delete Bigtable rows. As long as a DoFn was configured to output a Mutation object, it could output either a Put or a Delete, and CloudBigtableIO.writeToTable() successfully created, updated, or deleted a row for the given RowID.

It seems that the new Beam 2.2.0 API uses BigtableIO.write() function, which works with KV<RowID, Iterable<Mutation>>, where the Iterable contains a set of row-level operations. I have found how to use that to work on Cell-level data, so it's OK to create new rows and create/delete columns, but how do we delete rows now, given an existing RowID?

Any help appreciated!

** Some further clarification:

From this document: https://cloud.google.com/bigtable/docs/dataflow-hbase I understand that changing the dependency ArtifactID from bigtable-hbase-dataflow to bigtable-hbase-beam should be compatible with Beam version 2.2.0 and the article suggests doing Bigtble writes (and hence Deletes) in the old way by using CloudBigtableIO.writeToTable(). However that requires imports from the com.google.cloud.bigtable.dataflow family of dependencies, which the Release Notes suggest is deprecated and shouldn't be used (and indeed it seems incompatible with the new Configuration classes/etc.)

** Further Update:

It looks like my pom.xml didn't refresh properly after the change from bigtable-hbase-dataflow to bigtable-hbase-beam ArtifactID. Once the project got updated, I am able to import from the com.google.cloud.bigtable.beam.* branch, which seems to be working at least for the minimal test.

HOWEVER: It looks like now there are two different Mutation classes: com.google.bigtable.v2.Mutation and org.apache.hadoop.hbase.client.Mutation ?

And in order to get everything to work together, it has to be specified properly which Mutation is used for which operation?

Is there a better way to do this?

2

2 Answers

1
votes

I would suggest that you should continue using CloudBigtableIO and bigtable-hbase-beam. It shouldn't be too different from CloudBigtableIO in bigtable-hbase-dataflow.

CloudBigtableIO uses the HBase org.apache.hadoop.hbase.client.Mutation and translates them into the Bigtable equivalent values under the covers

3
votes

Unfortunately, Apache Beam 2.2.0 doesn't provide a native interface for deleting an entire row (including the row key) in Bigtable. The only full solution would be to continue using the CloudBigtableIO class as you already mentioned.

A different solution would be to just delete all the cells from the row. This way, you can fully move forward with using the BigtableIO class. However, this solution does NOT delete the row key itself, so the cost of storing the row key remains. If your application requires deleting many rows, this solution may not be ideal.

import com.google.bigtable.v2.Mutation
import com.google.bigtable.v2.Mutation.DeleteFromRow

// mutation to delete all cells from a row
Mutation.newBuilder().setDeleteFromRow(DeleteFromRow.getDefaultInstance()).build()