2
votes

What would be the storage and performance implication if we have multiple columns with same value for all the rows in a huge hive table that has underlining file format of ORC or Parquet storage format.

Lets say I have parquet hive table with column 5 and column 8 always having "HELLO" as the value.

  1. How does the file get stored with respect to ORC and Parquet in this scenario.
  2. Having duplicated column data , does it have any performance impact on the queries used later on this table.
1

1 Answers

3
votes

At least in the case of Parquet files, columns are compressed independently. Having the same value multiple times in a row often gets compressed very well but having a column duplicated also means duplication of the required storage.

For Parquet the compression scheme is roughly:

Per column, split into RowGroups (most often one per file, sometimes more but normally a very small number). Per RowGroup, encode the values (encodings are typically dictionary encoding or run-length encoding). Split the encoded rows roughly on 16KiB/1MiB boundaries named "pages". Compress each page individually with a universal compression codec like GZIP or ZStandard.