Spark HBase/BigTable - Wide/sparse dataframe persistence

Question

I want to persist to BigTable a very wide Spark Dataframe (>100,000 columns) that is sparsely populated (>99% of values are null) while keeping only non-null values (to avoid storage cost).

Is there a way to specify in Spark to ignore nulls when writing?

Thanks !

Does this answer your question? how to filter out a null value from spark dataframe — Igor Dvorzhak
@Igor Dvorzhak: Thanks. I want to avoid persisting null values (within a row or a column) not excluding an entire row or column, which is what the link suggests and would means data loss. — py-r
@IgorDvorzhak: Thanks. So you're suggesting to write Spark data row-by-row into BigTable while applying every time column pruning ? No batch way with some value pruning ? The hint for Parquet is welcome, but out-of-scope here as we're discussing BigTable ;) — py-r
@IgorDvorzhak: I've made few edits to your answer. Let me know if not ok. — py-r

Igor Dvorzhak Igor Dvorzhak · Accepted Answer · 2021-01-31T17:06:10

Probably (didn't test it), before writing a Spark DataFrame to HBase/BigTable you can transform it by filtering out columns with null values in each row using custom function, as suggested here for an example using pandas : https://stackoverflow.com/a/59641595/3227693. However there is no built-in connector supporting this feature to my best knowledge.

Alternatively, you can try store data in columnar file formats like Parquet instead, because they are efficiently handle persistence of sparse columnar data (at least in terms of output size in bytes). But to avoid writing many small files (due to sparse nature of the data) which can decrease write throughput, you probably will need to decrease number of output partitions before performing a write (i.e. write more rows per each parquet file: Spark parquet partitioning : Large number of files)

Spark HBase/BigTable - Wide/sparse dataframe persistence

1 Answers