6
votes

Tried to load csv files into bigquery table. There are columns where the types are INTEGER, but some missing values are NULL. So when I use the command bq load to load, got the following error:

Could not parse 'null' as int for field

So I am wondering what are the best solutions to deal with this, have to reprocess the data first for bq to load?

3
To confirm...it looks like you have values that are the string 'null' as opposed to actually being null. Is that right?Elliott Brossard
@ElliottBrossard That is right, so in the csv file, it is something like ,null, instead of ,,.Charles Zhan

3 Answers

7
votes

You'll need to transform the data in order to end up with the expected schema and data. Instead of INTEGER, specify the column as having type STRING. Load the CSV file into a table that you don't plan to use long-term, e.g. YourTempTable. In the BigQuery UI, click "Show Options", then select a destination table with the table name that you want. Now run the query:

#standardSQL
SELECT * REPLACE(SAFE_CAST(x AS INT64) AS x)
FROM YourTempTable;

This will convert the string values to integers where 'null' is treated as null.

2
votes

Please try with job config setting.

job_config.null_marker = 'NULL'

configuration.load.nullMarker
string
[Optional] Specifies a string that represents a null value in a CSV file. For example, if you specify "\N", BigQuery interprets "\N" as a null value when loading a CSV file. The default value is the empty string. If you set this property to a custom value, BigQuery throws an error if an empty string is present for all data types except for STRING and BYTE. For STRING and BYTE columns, BigQuery interprets the empty string as an empty value.

https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load

2
votes

BigQuery Console has it's limitations and doesn't allow you to specify a null marker while loading data from a CSV. However, it can easily be done by using the BigQuery command-line tool's bq load command. We can use the --null_marker flag to specify the marker which is simply null in this case.

bq load --source_format=CSV \
    --null_marker=null \
    --skip_leading_rows=1 \
    dataset.table_name \
    ./data.csv \
    ./schema.json

Setting the null_marker as null does the trick here. You can omit the schema.json part if the table is already present with a valid schema. --skip_leading_rows=1 is used because my first row was a header. You can learn more about the bg load command in the BigQuery Documentation.

The load command however lets you create and load a table in a single go. The schema needs to be specified in a JSON file in the below format:

[
 {
   "description": "[DESCRIPTION]",
   "name": "[NAME]",
   "type": "[TYPE]",
   "mode": "[MODE]"
 },
 {
   "description": "[DESCRIPTION]",
   "name": "[NAME]",
   "type": "[TYPE]",
   "mode": "[MODE]"
 }
]