I have csv (gzip compressed) files in GCS. I want to read these files and send data to BigQuery.
The header info can be changed (although I know all columns in advance), so just dropping a header is not enough, somehow I need to read the first line and append the column info to remaining line.
How is it possible?
I first think I must implement a custom source like this post.
Reading CSV header with Dataflow
But with this solution, I'm not sure how I can decompress Gzip first. Can I somehow use withCompressionType
like TextIO
?
(I found a parameter compression_type
in a python Class but I'm using Java and could not find a similar one in Java FileBasedSource
class.)
Also I feel this a bit overkilling because it makes a file unsplittable (although in my case it's okay).
Or I can use GoogleCloudStorage and directly read a file and its first line in the first place in my main()
function then proceed to a pipeline.
But it is also bothersome, so I want to confirm if there is any best practice (the Dataflow way) to read csv file while utilizing a header in Dataflow?
Source
to extend a behavior of file processing, and I know how to do that, but by doing so I lose the benefit of auto decompression of a gz file ofTextIO
. – Norio Akagi