Is it possible to compress a file already saved in Google cloud storage?
The files are created and populated by Google dataflow code. Dataflow cannot write to a compressed file but my requirement is to save it in compressed format.
You could write an app (perhaps using App Engine or Compute Engine) to do this. You would configure notifications on the bucket so your app is notified when a new object is written, and then runs, reads the object, compresses it, and overwrites the object and sets the Content-Encoding metadata field. Because object writes are transactional the compressed form of your object wouldn't become visible until it's complete. Note that if you do this any apps/services that consume the data would need to be able to handle either compressed or uncompressed formats. As an alternative, you could change your dataflow setup so it outputs to a temporary bucket, and set up notifications for that bucket to cause your compression program to run -- and that program then would write the compressed version to your production bucket and delete the uncompressed object.
Writing to compressed files is not supported on the standard TextIO.Sink because reading from compressed files is less scalable -- the file can't be split across multiple workers without first being decompressed.
If you want to do this (and aren't worried about potential scalability limits) you could look at writing a custom file-based sink that compresses the files. You can look at TextIO
for examples and also look at the docs how to write a file-based sink.
The key change from TextIO
would be modifying the TextWriteOperation
(which extends FileWriteOperation
) to support compressed files.
Also, consider filing a feature request against Cloud Dataflow and/or Apache Beam.
Another option could be to change your pipeline slightly.
Instead of your pipeline writing directly to GCS, you could write to a table(s) in BigQuery, and then when your pipeline is finished simply kick off a BigQuery export job to GCS with GZIP compression set.
https://cloud.google.com/bigquery/docs/exporting-data https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.extract.compression