0
votes

I have two pyarrow dataset schemas and for some reason they are different even though they should be the same (i assume that when storing one of the parquet files, for one partition certain column got cast to different data type, but i have no idea which one is it).

Now i know how to compare whether two schemas are the same. I can do that like so:

import pandas as pd
import numpy as np
import pyarrow as pa

df1 = pd.DataFrame({'col1': np.zeros(10), 'col2':np.random.rand(10)})
df2 = pd.DataFrame({'col1':np.ones(10), 'col2': np.zeros(10)})

schema_1 = pa.Schema.from_pandas(df1)
schema_2 = pa.Schema.from_pandas(df2)

schema_1.equals(schema_2)

df3 = df2.copy()
df3['col2'] = df3['col2'].astype('int')

schema_3 = pa.Schema.from_pandas(df3)
print(schema_1.equals(schema_2), schema_1.equals(schema_3))

But how do i find out where are they different? (Visual inspection doesn't count, i briefly tried and haven't seen any difference in over 500 columns)

1

1 Answers

0
votes

Each schema is basically an ordered group of pyarrow.field types. Therefore, pyarrow.schema can have fields that are different in terms of name, type, and perhaps some of the other properties of the field type. Also, order is probably important as well.

To find the fields in schema_3 that are not in schema_1, use sets.

set(schema_3).difference(set(schema_1))

To find just the names of fields that are different use the .names property

set(schema_3.names).difference(set(schema_1.names))