0
votes

Using pyarrow. I have a Parquet Dataset composed of multiple parquet files. If the columns differ between the files then i get a "ValueError: Schema in was different".

Is there a way to avoid this? Meaning i'd like to have a Dataset composed of files which each contain different columns.

I guess this could be done by pyarrow by filling in the values of the missing columns as na if the columns are not there in a particular component file of the Dataset.

Thanks

1
I wonder why this question has 2 downvotes, while it's quite clear and definitely non-obvious issue to solve with pyarrow. Did you manage to solve your problem? I am trying to solve similar issue - accessing hdfs and reading data from parquet files, however some files have a different schema than others.ira
It's by design, all files in a Dataset must have the same schema. There are efficient ways to read the schema of file so you can avoid blowing up.chris
It happens quite often that the schema evolves over time, as one day there might be a decision to add a column to the dataset, without changing the past parquet files. Ideally i would like to get in such a case missing values in the historical data when i try to read both old and new parquet files at once. But i wasn't able to figure it out yet. I suppose pyarrow might add this functionality to 'non-legacy' dataset in the future... or do you know how to handle this scenario?ira

1 Answers

-1
votes

Loads the files with separate dataframes such as df1 and df2, merge those dataframes by referencing THIS article.

In the article, you may find two ways to merge, one is

df1.merge(df2, how = 'outer')

and the other one with the pandas package as follows:

pd.concat([df1, df2])