I am converting data from CSV to Parquet using Python (Pandas) to later load it into Google BigQuery. I have some integer columns that contain missing values and since Pandas 0.24.0 I can store them as Int64 dtype.
Is there a way to use Int64 dtype also in a parquet file? I can't find a clean solution for ints with missing values (so they stay as INTEGER in BigQuery).
I have tried importing it directly to BigQuery and got the same error as when converting to parquet using Pandas (as shown below.)
Import a CSV with int column that includes a missing value:
import pandas as pd
df = pd.read_csv("docs/test_file.csv")
print(df["id"].info())
id 8 non-null float64
The line is imported as float64. I change the type to Int64:
df["id"] = df["id"].astype('Int64')
print(df["id"].info())
id 8 non-null Int64
Then I try and save to parquet:
df.to_parquet("output/test.parquet")
The error:
pyarrow.lib.ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for column id with type Int64')