3
votes

I'd like to store a tabular dataset in parquet format, using different files for different column groups. Is it possible to partition the parquet file column-wise? If so, is it possible to do it using python (pyarrow)?

I have a large dataset that collects properties/features (columns) for a number of objects (rows). The rows are on the order of 100k-1M (the rows will grow over time). Conversely, the columns are logically separated in 200 groups, each with 200-1000 columns per group. The total number of columns is fixed, but their data is acquired sequentially starting from col group 1, col group 2, ... . However the column names, types and number are not known in advance before receiving the first data batch sort that column group.

The data will be collected over time. I'd like to store this growing set of columns in parquet as the data arrives. Eventually, all the column groups will be filled with data. Over time new objects (rows) will arrive and their data will always start with col group 1 and fill the other groups progressively.

Is it possible (or advisable) to store these data in a single logical parquet file split over multiple files on the file system, where each file contains a column group (200-1000 columns)? Can somebody provide an example of storing such a file using python/pandas/pyarrow?

Alternatively, each col group can be stored as a different logical parquet file. In this case, all files will have an object_id index column, but each parquet file (for a col group) would contain a different subset of objects. Any though or suggestions are appreciated.

1
can you add a eg?Johnny

1 Answers

2
votes

Parquet files have a single schema. Even if there is multiple partitions each will have the same schema which enables tools to read these files as if they were one single file.

If the incoming data from pandas side is changing writing to parquet file will not work since the schema is not the same as the source is having.

To make this work your data pipeline work you need to consider at least the following:

Collect ALL columns with their datatypes and column order

Format dataframe to contain ALL columns with specified datatype and column order

Write to parquet

Please see below code to get some more knowledge about the possible failures.

df = pd.DataFrame({"Date":{"0":1514764800000,"1":1514851200000,"2":1514937600000,"3":1515024000000,"4":1515110400000,"5":1515196800000,"6":1515283200000,"7":1515369600000},"Day":{"0":1,"1":2,"2":3,"3":4,"4":5,"5":6,"6":7,"7":8},"Year":{"0":2018,"1":2018,"2":2018,"3":2018,"4":2018,"5":2018,"6":2018,"7":2018},"Month":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"6":1,"7":1},"randNumCol":{"0":2,"1":5,"2":4,"3":3,"4":3,"5":5,"6":4,"7":3},"uuid":{"0":"456578af-8953-4cf7-ac27-70309353b72c","1":"df6a30da-619e-4594-a051-4fdb3572eb49","2":"7cfe724a-a827-47b1-a691-c741f4f1101d","3":"f1796ed1-f7ce-4b49-ba64-6aacdca02c0a","4":"827e4aae-1214-4c0f-ac7f-9439e8a577af","5":"08dc3c2b-b75c-4ac6-8a38-0a44007fdeaf","6":"54f4e7bb-6fd8-4913-a2c3-69ebc13dc9a2","7":"eda1dbfe-ad08-4067-b064-bcc689fa0225"},"NEWCOLUMN":{"0":1514764800000,"1":1514851200000,"2":1514937600000,"3":1515024000000,"4":1515110400000,"5":1515196800000,"6":1515283200000,"7":1515369600000}})
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table,root_path='output.parquet',partition_cols=['Year','Month','Day'])
#Read Table OK
pandas_df=pd.read_parquet('output.parquet')
print(pandas_df)

#Second Table Same Exact Columns in the Same order
df = pd.DataFrame({"Date":{"0":1514764800000,"1":1514851200000,"2":1514937600000,"3":1515024000000,"4":1515110400000,"5":1515196800000,"6":1515283200000,"7":1515369600000},"Day":{"0":1,"1":2,"2":3,"3":4,"4":5,"5":6,"6":7,"7":8},"Year":{"0":2018,"1":2018,"2":2018,"3":2018,"4":2018,"5":2018,"6":2018,"7":2018},"Month":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"6":1,"7":1},"randNumCol":{"0":2,"1":5,"2":4,"3":3,"4":3,"5":5,"6":4,"7":3},"uuid":{"0":"456578af-8953-4cf7-ac27-70309353b72c","1":"df6a30da-619e-4594-a051-4fdb3572eb49","2":"7cfe724a-a827-47b1-a691-c741f4f1101d","3":"f1796ed1-f7ce-4b49-ba64-6aacdca02c0a","4":"827e4aae-1214-4c0f-ac7f-9439e8a577af","5":"08dc3c2b-b75c-4ac6-8a38-0a44007fdeaf","6":"54f4e7bb-6fd8-4913-a2c3-69ebc13dc9a2","7":"eda1dbfe-ad08-4067-b064-bcc689fa0225"},"NEWCOLUMN":{"0":1514764800000,"1":1514764800000,"2":1514764800000,"3":1514764800000,"4":1514764800000,"5":1514764800000,"6":1514764800000,"7":1514764800000}})
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table,root_path='output.parquet',partition_cols=['Year','Month','Day'])
#Read Table OK
pandas_df=pd.read_parquet('output.parquet')
print(pandas_df)

#Second Table same exact columns but wrong order ->Fails
df = pd.DataFrame({"NEWCOLUMN":{"0":1514764800000,"1":1514851200000,"2":1514937600000,"3":1515024000000,"4":1515110400000,"5":1515196800000,"6":1515283200000,"7":1515369600000},"Day":{"0":1,"1":2,"2":3,"3":4,"4":5,"5":6,"6":7,"7":8},"Year":{"0":2018,"1":2018,"2":2018,"3":2018,"4":2018,"5":2018,"6":2018,"7":2018},"Month":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"6":1,"7":1},"randNumCol":{"0":2,"1":5,"2":4,"3":3,"4":3,"5":5,"6":4,"7":3},"uuid":{"0":"456578af-8953-4cf7-ac27-70309353b72c","1":"df6a30da-619e-4594-a051-4fdb3572eb49","2":"7cfe724a-a827-47b1-a691-c741f4f1101d","3":"f1796ed1-f7ce-4b49-ba64-6aacdca02c0a","4":"827e4aae-1214-4c0f-ac7f-9439e8a577af","5":"08dc3c2b-b75c-4ac6-8a38-0a44007fdeaf","6":"54f4e7bb-6fd8-4913-a2c3-69ebc13dc9a2","7":"eda1dbfe-ad08-4067-b064-bcc689fa0225"},"Date":{"0":1514764800000,"1":1514764800000,"2":1514764800000,"3":1514764800000,"4":1514764800000,"5":1514764800000,"6":1514764800000,"7":1514764800000}})
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table,root_path='output.parquet',partition_cols=['Year','Month','Day'])
pandas_df=pd.read_parquet('output.parquet')
print(pandas_df)

#Third Table with "NEWCOLUMN" left out ->Fails
df = pd.DataFrame({"Date":{"0":1514764800000,"1":1514851200000,"2":1514937600000,"3":1515024000000,"4":1515110400000,"5":1515196800000,"6":1515283200000,"7":1515369600000},"Day":{"0":1,"1":2,"2":3,"3":4,"4":5,"5":6,"6":7,"7":8},"Year":{"0":2018,"1":2018,"2":2018,"3":2018,"4":2018,"5":2018,"6":2018,"7":2018},"Month":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"6":1,"7":1},"randNumCol":{"0":2,"1":5,"2":4,"3":3,"4":3,"5":5,"6":4,"7":3},"uuid":{"0":"456578af-8953-4cf7-ac27-70309353b72c","1":"df6a30da-619e-4594-a051-4fdb3572eb49","2":"7cfe724a-a827-47b1-a691-c741f4f1101d","3":"f1796ed1-f7ce-4b49-ba64-6aacdca02c0a","4":"827e4aae-1214-4c0f-ac7f-9439e8a577af","5":"08dc3c2b-b75c-4ac6-8a38-0a44007fdeaf","6":"54f4e7bb-6fd8-4913-a2c3-69ebc13dc9a2","7":"eda1dbfe-ad08-4067-b064-bcc689fa0225"}})
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table,root_path='output.parquet',partition_cols=['Year','Month','Day'])
pandas_df=pd.read_parquet('output.parquet')
print(pandas_df)