I would try to prepare dataset for processing without null
s. In case of few columns this simple iterative approach might work fine (don't forget to import spark implicits before import spark.implicits._
):
val example = spark.sparkContext.parallelize(Seq(
Grade(1,3,1,1),
Grade(1,1,0,1),
Grade(1,10,2,1)
)).toDS().cache()
def allOnes(colName: String, ds: Dataset[Grade]): Boolean = {
val row = ds.select(colName).distinct().collect()
if (row.length == 1 && row.head.getInt(0) == 1) true
else false
}
val resultColumns = example.columns.filter(col => allOnes(col, example))
example.selectExpr(resultColumns: _*).show()
result is:
+---+---+
| c1| c4|
+---+---+
| 1| 1|
| 1| 1|
| 1| 1|
+---+---+
If null
s are inevitable, use untyped dataset (aka dataframe):
val schema = StructType(Seq(
StructField("c1", IntegerType, nullable = true),
StructField("c2", IntegerType, nullable = true),
StructField("c3", IntegerType, nullable = true),
StructField("c4", IntegerType, nullable = true)
))
val example = spark.sparkContext.parallelize(Seq(
Row(1,3,1,1),
Row(1,1,null,1),
Row(1,10,2,1)
))
val dfInput = spark.createDataFrame(example, schema).cache()
def allOnes(colName: String, df: DataFrame): Boolean = {
val row = df.select(colName).distinct().collect()
if (row.length == 1 && row.head.getInt(0) == 1) true
else false
}
val resultColumns= dfInput.columns.filter(col => allOnes(col, dfInput))
dfInput.selectExpr(resultColumns: _*).show()