3
votes

I want to load data with multiple character delimiter to BigQuery. BQ load command currently does not support multiple character delimiter. It supports only single character delimiter like '|', '$', '~' etc

I know there is a dataflow approach where it will read data from those files and write to BigQuery. But I have a large number of small files(each file of 400MB) which have to be written a separate partition of a table(partition numbering around 700). This approach is slow with dataflow because I have to currently start a different dataflow job for writing each file to a separate table using a for loop. This approach is running for more than 24 hours and still not complete.

So is there any other approach to load these multiple files having multiple character delimiter to each partition of BigQuery?

2

2 Answers

1
votes

From the Dataflow perspective, you can make this easier by uploading multiple files in each pipeline. You can have a for loop in your main method while assembling the pipeline, essentially having many Read -> Write to BigQuery steps.

See also Strategy for loading data into BigQuery and Google cloud Storage from local disk for more information.

0
votes

My lazy approach to these problems: Don't parse in Dataflow, just send each row raw to BigQuery (one column per row).

Then you can parse inside BigQuery with a JS UDF.