5
votes

Let's say I have a table with one single field named "version", which is a string. When I try to load data into the table using autodetect with values like "1.1" or "1", the autodetect feature infers these values as float or integer type respectively.

data1.json example:

{ "version": "1.11.0" }

bq load output:

$ bq load --autodetect --schema_update_option=ALLOW_FIELD_ADDITION --source_format=NEWLINE_DELIMITED_JSON temp_test.temp_table ./data1.json Upload complete. Waiting on bqjob_ZZZ ... (1s) Current status: DONE

data2.json example:

{ "version": "1.11" }

bq load output:

$ bq load --autodetect --schema_update_option=ALLOW_FIELD_ADDITION --source_format=NEWLINE_DELIMITED_JSON temp_test.temp_table ./data2.json Upload complete. Waiting on bqjob_ZZZ ... (0s) Current status: DONE
BigQuery error in load operation: Error processing job 'YYY:bqjob_ZZZ': Invalid schema update. Field version has changed type from STRING to FLOAT

data3.json example:

{ "version": "1" }

bq load output:

$ bq load --autodetect --schema_update_option=ALLOW_FIELD_ADDITION --source_format=NEWLINE_DELIMITED_JSON temp_test.temp_table ./data3.json Upload complete. Waiting on bqjob_ZZZ ... (0s) Current status: DONE
BigQuery error in load operation: Error processing job 'YYY:bqjob_ZZZ': Invalid schema update. Field version has changed type from STRING to INTEGER

The scenario where this problem doesn't happen is when you have, in the same file, another JSON where the value is inferred correctly as string (as seen in Bigquery autoconverting fields in data question):

{ "version": "1.12" }
{ "version": "1.12.0" }

In the question listed above, there's an answer stating that a fix was pushed to production, but it looks like the bug is back again. Is there a way/workaround to prevent this?

1
It looks to me BigQuery is inferring the types correctly? data1.json is a string, data2.json is a float, and data3.json is a integer. If the value for a column is "1.12", then the most accurate type that describes the column is float; if the value for a column can be "1.12" or "1.12.0", then the type of the column can only be string. It'll be nice if BigQuery can use the type of the column in the table to help the auto-detect, but I think it's more like a feature request than a bug. - Hua Zhang
@HuaZhang I don't agree. If the value is enclosed by double quotes, it should be considered a string. In a standard JSON file, booleans, integers and floats don't have double quotes. - Sandro Ribeiro
I see your point. BigQuery has been able to parse strings like "1.12" as double for a long time though, so auto-detect strings like "1.12" as double is by design. Many customers depend on this design, so it's unlikely to change. - Hua Zhang
@HuaZhang gotcha. A parameter to disable this behavior for existing fields would be great. Thanks anyway. - Sandro Ribeiro
@HuaZhang, could you post a comment with the information you provided in the comments? It will help the Community to better understand BigQuery ;) - enle lin

1 Answers

3
votes

Looks like the confusing part here is whether "1.12" should be detected as string or float. BigQuery chose to detect as float. Before autodetect is introduced in BigQuery, BigQuery allows users to load float values in string format. This is very common in CSV/JSON format. So when autodetect is introduced, it kept this behavior. Autodetect will scan up to 100 rows to detect the type. If for all 100 rows, the data is like "1.12", then very likely this field is a float value. If one of the row has value "1.12.0", then BigQuery will detect the type is string, as you have observed.