2
votes

I'm trying to use google cloud dataflow to read data from GCS and load to BigQuery tables, however the files in GCS are compressed(gzip), is there any class can be used to read data from compressed/gzipped files?

1
Does this answer your question? Reading from compressed files in DataflowMark Rotteveel

1 Answers

6
votes

Reading from compressed text sources is now supported in Dataflow (as of this commit). Specifically, files compressed with gzip and bzip2 can be read from by specifying the compression type:

TextIO.Read.from(myFileName).withCompressionType(TextIO.CompressionType.GZIP)

However, if the file has a .gz or .bz2 extension, you don't have do do anything: the default compression type is AUTO, which examines file extensions to determine the correct compression type for a file. This even works with globs, where the files that result from the glob may be a mix of .gz, .bz2, and uncompressed.