1
votes

I have the following infrastructure in place: Dataflow is used to send messages from AWS SQS to Google Cloud's Pub/Sub. Messages are read with java and Apache Beam (SqsIO).

Is there a way with Dataflow to delete the messages in AWS SQS once they arrive / are read in PubSub and how would that look like? Can this be done in java with Apache Beam?

Thank you for any answers in advance!

1

1 Answers

1
votes

There's no in-built support for message deletion, but you can add code to delete messages that are read from AWS SQS using a Beam ParDo. But you must perform such a deletion with care.

A Beam runner performs reading using one or more workers. A given work item could fail at any time and a runner usually re-runs a failed work item. Additionally, most runners fuse multiple steps. For example, if you have a Read transform followed by a delete ParDo, a runner may fuse these transforms an execute them together. Now if a work item fails after partially deleting data, a re-run of such a work item may fail or may produce incorrect data.

The usual solution is to add a fusion break between the two steps. You can achieve this with Beam's Reshuffle.viaRandomKey() transform (or just by adding any transform that uses GroupByKey). For example, the flow of your program can be as follows.

pipeline
    .apply(SqsIO.read())
    .apply(Reshuffle.viaRandomKey())
    .apply(ParDo.of(new DeleteSQSDoFn()))
    .apply(BigQuery.Write(...))