5
votes

I need to transfer a large table in BigQuery, 2B records, to Cloud Storage with csv format. I am doing the transfer using the console.

I need to specify a uri including a * to shard the export due to the size of the file. I end up with 400 csv files in Cloud Storage. Each has a header row.

This makes combining the files time consuming, since I need to download the csv files to another machine, strip out the header rows, combine the files, and then re-upload. FY the size of the combined csv file is about 48GB.

Is there a better approach for this?

2
What do you do with these CSV files after combining? Why do you need to combine before uploading(why can't upload separately)? Do you really need headers, or you can assume in the code about the column sequence?Ashish

2 Answers

4
votes

Using the API, you will be able to tell BigQuery not to print the header row during the table extraction. This is done by setting the configuration.extract.printHeader option to false. See the documentation for more info. The command-line utility should also be able to do that.

Once you've done this, concatenating the files is much easier. In a Linux/Mac computer it would be a single cat command. However, you could also try to concatenate directly from Cloud Storage by using the compose operation. See more details here. Composition can be performed either from the API or the command line utility.

Since composition actions is limited to 32 components, you will have to compose 32 files after 32 files. That should make around 13 composition operations for 400 files. Note that I have never tried the composition operation, so I'm just guessing on this part.

0
votes

From the console, use the bq utility to strip the headers:

bq --skip_leading_rows 1