1
votes

I have job that writes records to a remote file system using StreamingFileSink. I would like to build an index of which part file contains a record. Is there a way to achieve this using Flink APIs ? Or is there a way to know that a part file is finished and I can now process it offline for indexing ?

1

1 Answers

0
votes

In the daily build of the Flink documentation, the part file lifecycle of the StreamingFileSink is described in detail.

The short answer is that the part files are renamed when they can safely be consumed -- "safely" meaning that the file has been closed (no further writes will occur) and checkpointed. At this point the filename will change from part-subtaskIndex-partFileIndex.inprogress.uid to part-subtaskIndex-partFileIndex. For example, the name might change from part-1-0.inprogress.ea65a428-a1d0-4a0b-bbc5-7a436a75e575 to part-1-0.

Note that having checkpointing enabled is required for correct operation of the StreamingFileSink.

Coming back to the first part of your question -- can you determine which part file contains a given record, using the public API? I believe that is done by KeyGroupRangeAssignment.assignKeyToParallelOperator.

For an explanation of how keyed state is organized, see A Deep Dive into Rescalable State in Apache Flink.