2
votes

Is it possible to compress a file already saved in Google cloud storage?

The files are created and populated by Google dataflow code. Dataflow cannot write to a compressed file but my requirement is to save it in compressed format.

3

3 Answers

0
votes

You could write an app (perhaps using App Engine or Compute Engine) to do this. You would configure notifications on the bucket so your app is notified when a new object is written, and then runs, reads the object, compresses it, and overwrites the object and sets the Content-Encoding metadata field. Because object writes are transactional the compressed form of your object wouldn't become visible until it's complete. Note that if you do this any apps/services that consume the data would need to be able to handle either compressed or uncompressed formats. As an alternative, you could change your dataflow setup so it outputs to a temporary bucket, and set up notifications for that bucket to cause your compression program to run -- and that program then would write the compressed version to your production bucket and delete the uncompressed object.

3
votes

Writing to compressed files is not supported on the standard TextIO.Sink because reading from compressed files is less scalable -- the file can't be split across multiple workers without first being decompressed.

If you want to do this (and aren't worried about potential scalability limits) you could look at writing a custom file-based sink that compresses the files. You can look at TextIO for examples and also look at the docs how to write a file-based sink.

The key change from TextIO would be modifying the TextWriteOperation (which extends FileWriteOperation) to support compressed files.

Also, consider filing a feature request against Cloud Dataflow and/or Apache Beam.

2
votes

Another option could be to change your pipeline slightly.

Instead of your pipeline writing directly to GCS, you could write to a table(s) in BigQuery, and then when your pipeline is finished simply kick off a BigQuery export job to GCS with GZIP compression set.

https://cloud.google.com/bigquery/docs/exporting-data https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.extract.compression