0
votes

I can see in the Java SDK documentation that we can specify compression in the FileIO.ReadableFile utility class - https://beam.apache.org/releases/javadoc/2.2.0/org/apache/beam/sdk/io/FileIO.ReadableFile.html#open--

However, I am using Python where it is available as an argument (apache_beam.io.fileio.ReadMatches(compression=None, skip_directories=True)) but skimming through the source code I don't think it does anything - https://beam.apache.org/releases/pydoc/2.16.0/apache_beam.io.fileio.html#apache_beam.io.fileio.ReadMatches

Can somebody confirm if I can open bz2 files with this class?

I specifically need it so I can use the metadata (metadata.path for filename) so if anybody has some creative ideas on how I can add the filename to each of my rows as a side input somehow, please share these too.

2
Starting on 2.18.0, compressed files will be supported. - Pablo

2 Answers

2
votes

Not yet possible (as @Pablo answer) but, if you want to start now, you can start with the decompressorBulkTemplate of Dataflow. There is lot of lines but the code is not hard to understand.

Don't write out, process your file after the decompression. It's a good starting point for starting today.