2
votes

I want to persist to BigTable a very wide Spark Dataframe (>100,000 columns) that is sparsely populated (>99% of values are null) while keeping only non-null values (to avoid storage cost).

Is there a way to specify in Spark to ignore nulls when writing?

Thanks !

1
@Igor Dvorzhak: Thanks. I want to avoid persisting null values (within a row or a column) not excluding an entire row or column, which is what the link suggests and would means data loss. - py-r
Thank you for clarification, updated my answer. - Igor Dvorzhak
@IgorDvorzhak: Thanks. So you're suggesting to write Spark data row-by-row into BigTable while applying every time column pruning ? No batch way with some value pruning ? The hint for Parquet is welcome, but out-of-scope here as we're discussing BigTable ;) - py-r
@IgorDvorzhak: I've made few edits to your answer. Let me know if not ok. - py-r

1 Answers

2
votes

Probably (didn't test it), before writing a Spark DataFrame to HBase/BigTable you can transform it by filtering out columns with null values in each row using custom function, as suggested here for an example using pandas : https://stackoverflow.com/a/59641595/3227693. However there is no built-in connector supporting this feature to my best knowledge.

Alternatively, you can try store data in columnar file formats like Parquet instead, because they are efficiently handle persistence of sparse columnar data (at least in terms of output size in bytes). But to avoid writing many small files (due to sparse nature of the data) which can decrease write throughput, you probably will need to decrease number of output partitions before performing a write (i.e. write more rows per each parquet file: Spark parquet partitioning : Large number of files)