Apache Beam to BigQuery in batch, intermediary files, are they only generated in JSON

Question

I'm reading CSV files, transforming them before writing them to BigQuery through beam (2.1.0) in cloud Dataflow. The generated intermediary files in GCS for Bq load jobs are JSON files. Is there a way to generate them in CSV rather than in JSON, which will consume less space and IO. And if there is a way to change that why the default is JSON not CSV Best regards,

jkff jkff · Accepted Answer · 2017-09-08T19:48:59

CSV does not support nested or repeated data in the schema, that's why Beam does not use it for BigQuery import. JSON and Avro formats support it, and it may be a good idea to change the implementation to use Avro (we already use Avro for exporting data from BigQuery). Feel free to file a JIRA at https://issues.apache.org/jira/browse/BEAM.

Apache Beam to BigQuery in batch, intermediary files, are they only generated in JSON

1 Answers