Parquet Format - split columns in different files

Question

On the parquet documentation is explicitly mentioned that the design supports splitting the metadata and data into different files , including also the possibility that different column groups can be stored in different files.

However , I could not find any instructions on how to achieve that. In my use case I would like to store the metadata in one file , store columns 1-100 data in one file and 101-200 in a second file .

Any idea how to achieve this ?

Alex W Alex W · Accepted Answer · 2021-02-16T20:49:19

If you are using PySpark, it's as easy as this:

df = spark.createDataFrameFrom(...)
df.write.parquet('file_name.parquet')

and it will create a folder called file_name.parquet in the default location in HDFS. You can just create two dataframes, one with columns 1-100, and the other dataframe with columns 101-200 and save them separately. It automatically will save the metadata, if you mean the data frame schema.

You can select a range of columns like this:

df_first_hundred = df.select(df.columns[:100])
df_second_hundred = df.select(df.columns[100:])

Save them as separate files:

df_first_hundred.write.parquet('df_first_hundred')
df_second_hundred.write.parquet('df_second_hundred')

Parquet Format - split columns in different files

1 Answers