Scala spark, input dataframe, return columns where all values equal to 1

Question

Given a dataframe, say that it contains 4 columns and 3 rows. I want to write a function to return the columns where all the values in that column are equal to 1.

This is a Scala code. I want to use some spark transformations to transform or filter the dataframe input. This filter should be implemented in a function.

case class Grade(c1: Integral, c2: Integral, c3: Integral, c4: Integral)
val example = Seq(
      Grade(1,3,1,1),
      Grade(1,1,null,1),
      Grade(1,10,2,1)
    )

    val dfInput = spark.createDataFrame(example)

After I call the function filterColumns()

val dfOutput = dfInput.filterColumns()

it should return 3 row 2 columns dataframe with value all 1.

chlebek chlebek · Accepted Answer · 2019-10-15T22:23:02

one of the options is reduce on rdd:

  import spark.implicits._

  val df= Seq(("1","A","3","4"),("1","2","?","4"),("1","2","3","4")).toDF()
  df.show()

  val first = df.first()
  val size = first.length
  val diffStr = "#"
  val targetStr = "1"

   def rowToArray(row: Row): Array[String] = {
     val arr = new Array[String](row.length)
     for (i <- 0 to row.length-1){
       arr(i) = row.getString(i)
     }
     arr
   }

  def compareArrays(a1: Array[String], a2: Array[String]): Array[String] = {
    val arr = new Array[String](a1.length)
    for (i <- 0 to a1.length-1){
      arr(i) = if (a1(i).equals(a2(i)) && a1(i).equals(targetStr)) a1(i) else diffStr
    }
    arr
  }

  val diff = df.rdd
    .map(rowToArray)
    .reduce(compareArrays)

  val cols = (df.columns zip diff).filter(!_._2.equals(diffStr)).map(s=>df(s._1))

  df.select(cols:_*).show()

    +---+---+---+---+
    | _1| _2| _3| _4|
    +---+---+---+---+
    |  1|  A|  3|  4|
    |  1|  2|  ?|  4|
    |  1|  2|  3|  4|
    +---+---+---+---+

    +---+
    | _1|
    +---+
    |  1|
    |  1|
    |  1|
    +---+

Scala spark, input dataframe, return columns where all values equal to 1

3 Answers