1
votes

I am pretty new to scala I have a situation where I have a dataframe with multiple columns, some of these columns having random null values at random places. I need to find any such column having even a single null value and drop it from the dataframe.

#### Input

| Column 1      | Column 2      | Column 3      | Column 4      | Column 5      |
| --------------| --------------| --------------| --------------| --------------|
|(123)-456-7890 | 123-456-7890  |(123)-456-789  |               |(123)-456-7890 |
|(123)-456-7890 | 123-4567890   |(123)-456-7890 |(123)-456-7890 | null          |
|(123)-456-7890 | 1234567890    |(123)-456-7890 |(123)-456-7890 | null          |

#### Output

| Column 1      | Column 2      |
| --------------| --------------| 
|(123)-456-7890 | 123-456-7890  |
|(123)-456-7890 | 123-4567890   |
|(123)-456-7890 | 1234567890    |

Please advise. Thank you.

1
Hi A8H1, check the below link, I think it is the answer you are looking for: stackoverflow.com/questions/51322445/… - Rajat Tanwar

1 Answers

2
votes

I would recommend a 2-step approach:

  1. Exclude columns that are not nullable from the dataframe
  2. Assemble a list of columns that contain at least a null and drop them altogether

Creating a sample dataframe with a mix of nullable/non-nullable columns:

import org.apache.spark.sql.Column
import org.apache.spark.sql.types._

val df0 = Seq(
  (Some(1), Some("x"), Some("a"), None),
  (Some(2), Some("y"), None,      Some(20.0)),
  (Some(3), Some("z"), None,      Some(30.0))
).toDF("c1", "c2", "c3", "c4")

val newSchema = StructType(df0.schema.map{ field =>
    if (field.name == "c1") field.copy(name = "c1_notnull", nullable = false) else field
  })

// Revised dataframe with non-nullable `c1`
val df = spark.createDataFrame(df0.rdd, newSchema)

Carrying out step 1 & 2:

val nullableCols = df.schema.collect{ case StructField(name, _, true, _) => name }
// nullableCols: Seq[String] = List(c2, c3, c4)

val colsWithNulls = nullableCols.filter(c => df.where(col(c).isNull).count > 0)
// colsWithNulls: Seq[String] = List(c3, c4)

df.drop(colsWithNulls: _*).show
// +----------+---+
// |c1_notnull| c2|
// +----------+---+
// |         1|  x|
// |         2|  y|
// |         3|  z|
// +----------+---+