Impact of having same value for a column in huge hive table with ORC/Parquet file format

Question

What would be the storage and performance implication if we have multiple columns with same value for all the rows in a huge hive table that has underlining file format of ORC or Parquet storage format.

Lets say I have parquet hive table with column 5 and column 8 always having "HELLO" as the value.

How does the file get stored with respect to ORC and Parquet in this scenario.
Having duplicated column data , does it have any performance impact on the queries used later on this table.

Uwe L. Korn Uwe L. Korn · Accepted Answer · 2020-03-26T12:40:57

At least in the case of Parquet files, columns are compressed independently. Having the same value multiple times in a row often gets compressed very well but having a column duplicated also means duplication of the required storage.

For Parquet the compression scheme is roughly:

Per column, split into RowGroups (most often one per file, sometimes more but normally a very small number). Per RowGroup, encode the values (encodings are typically dictionary encoding or run-length encoding). Split the encoded rows roughly on 16KiB/1MiB boundaries named "pages". Compress each page individually with a universal compression codec like GZIP or ZStandard.

Impact of having same value for a column in huge hive table with ORC/Parquet file format

1 Answers