1
votes

I have some files uploaded to Google Cloud Storage (csv and json).

I could create BigQuery tables, native or external, linking to these files in Google Cloud Storage.

In the process of creating bigquery tables, I could check "Schema Automatically detect".

The "Schema Automatically detect" works well with json new line delimited format file. But with the csv file, first row is the 'column name", bigquery cannot do the "schema automatically detect", it treats the first line as data, and then the schema bigquery created will be string_field_1, string_field_2 etc.

Are there anything that I need to do for my csv file that makes bigquery "Schema Automatically detect" works?

The csv file I have is "Microsoft Excel Comma Separated Value File".

Update:

If first column is empty, BigQuery autodetect doesn't detect headers

custom id,asset id,related isrc,iswc,title,hfa song code,writers,match policy,publisher name,sync ownership share,sync ownership territory,sync ownership restriction
,A123,,,Medley of very old Viennese songs,,,,,,,
,A234,,,Suite de pièces No. 3 en Ré Mineur  HWV 428 - Allemande,,,,,,,

But if first column is not empty - it is OK:

custom id,asset id,related isrc,iswc,title,hfa song code,writers,match policy,publisher name,sync ownership share,sync ownership territory,sync ownership restriction
1,A123,,,Medley of very old Viennese songs,,,,,,,
2,A234,,,Suite de pièces No. 3 en Ré Mineur  HWV 428 - Allemande,,,,,,,

Should it be a feature improvement request for BigQuery?

2

2 Answers

2
votes

CSV autodetect does detect header line in CSV files, so there must be something special about your data. It would be good if you can provide the real data snippet and the actual commands you used. Here is my example that demonstrates how it works:

~$ cat > /tmp/people.csv
Id,Name,DOB
1,Bill Gates,1955-10-28
2,Larry Page,1973-03-26
3,Mark Zuckerberg,1984-05-14
~$ bq load --source_format=CSV --autodetect dataset.people /tmp/people.csv
Upload complete.
Waiting on bqjob_r33dc9ca5653c4312_0000015af95f6209_1 ... (2s) Current status: DONE   
~$ bq show dataset.people
Table project:dataset.people

   Last modified        Schema        Total Rows   Total Bytes   Expiration   Labels  
 ----------------- ----------------- ------------ ------------- ------------ -------- 
  22 Mar 21:14:27   |- Id: integer    3            89                                 
                    |- Name: string                                                   
                    |- DOB: date                            
0
votes
custom id,asset id,related isrc,iswc,title,hfa song code,writers,match policy,publisher name,sync ownership share,sync ownership territory,sync ownership restriction
,A123,,,Medley of very old Viennese songs,,,,,,,
,A234,,,Suite de pièces No. 3 en Ré Mineur  HWV 428 - Allemande,,,,,,,

If the first column is empty, Google BigQuery cannot detect the schema.

custom id,asset id,related isrc,iswc,title,hfa song code,writers,match policy,publisher name,sync ownership share,sync ownership territory,sync ownership restriction
1,A123,,,Medley of very old Viennese songs,,,,,,,
2,A234,,,Suite de pièces No. 3 en Ré Mineur  HWV 428 - Allemande,,,,,,,

If I add the value to the first column, then Google BigQuery can detect the schema.

Should it be a feature improvement request for BigQuery?