0
votes

I am trying to use the bigquery export functionality to push data out to GCS in json format. At the end of the process inorder to validate the count of exported records in the GCS file, I am creating an external table with auto schema detection just to take a count of records in the GCS files exported.

This works for single exported files. But for tables greater than 1gb in size, i use the wild card inorder to split into multiple files. This results in multiple files with some empty files as well created.

The empty files are causing an error while querying the external table : "400 Schema has no fields".

Please suggest any ideas to:

  1. Either make sure that empty files do not get created in the export operation for multiple files scenario
  2. To ignore empty files in the external table creation.
  3. Any other way to take count of records in GCS after the export operation
1
It is strange that BQ creates empty files. In my case, it works every time I try with a table larger than 5GB. All the files created contain at least more than 35MB. How are you doing the export? Have you tried from the UI? Could be possible that you already had some files created on that bucket that a messing up with the new ones?Alvaro
I am using the python library to do the export. I have tried the export using the UI as well and the behaviour is the same..creates multiple empty files along with the split files (less than 1 gb)jay12345
Looks like its an existing behaviour for bq exports to produce some empty files as well..I have found a workaround for my problem by setting the schema for the external table instead of using auto schema detection and then it works even if empty files are present.jay12345
@jay12345 Can you share you thoughts and a potential workaround addressing the purpose of the question in a valuable answer that can make a best effort for the Stack community?Nick_Kh

1 Answers

0
votes

I had the same problem but I found a workaround: it seems a TEMP TABLE does the trick.
Imagine we have the following query:

EXPORT DATA OPTIONS(
    uri='gs://mybucket/extract-here/*.csv.gz'
    , format='CSV'
    , compression='GZIP'
    , overwrite=true
    , header=true
    , field_delimiter=","
) AS (
    WITH mytable AS (
        SELECT col FROM UNNEST([1,2,3,4,5,6,7,8]) AS col
    )
    SELECT * FROM mytable
);

You can rewrite it as following:

BEGIN
    CREATE TEMP TABLE _SESSION.tmpExportTable AS (
        WITH mytable AS (
            SELECT col FROM UNNEST([1,2,3,4,5,6,7,8]) AS col
        )
        SELECT * FROM mytable
    );
    EXPORT DATA OPTIONS(
        uri='gs://mybucket/extract-here/*.csv.gz'
        , format='CSV'
        , compression='GZIP'
        , overwrite=true
        , header=true
        , field_delimiter=","
    ) AS
    SELECT * FROM _SESSION.tmpExportTable;
END;