How to select specific columns from Spark DataFrame based on the value of another column?

Question

Consider a DataFrame df with 4 columns c0, c1, c2 and c3 where c0 and c1 are nested columns(struct type) and the other two are string type:

root
 |-- c0: struct (nullable = true)
 |    |-- x: string (nullable = true)
 |    |-- y: string (nullable = true)
 |-- c1: struct (nullable = true)
 |    |-- x: string (nullable = true)
 |    |-- y: string (nullable = true)
 |-- c2: string (nullable = true)
 |-- c3: string (nullable = true)

I want to select all the values of c0 or c1 based on the value of c3.

Example: If the value of c3 is "d", I want to select c0.* else c1.*

This is what I've tried so far but without luck:

Approach: Using when and otherwise inside the select clause.

.select(
  col("c3"),
  col("c4"),
  when(col("c3") === "d", col("c0.*").otherwise(col("c1.*"))))

This gives the following exception:

org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 'casewhen';

Then instead of using col I've tried using df:

.select(
  col("c3"),
  col("c4"),
  when(col("c3") =!= "d", df("c0").otherwise(df("c1"))))

This gives the following exception:

otherwise() can only be applied on a Column previously generated by when()

Any help on this would be appreciated!

PS: I'm a beginner in Spark :)

blackbishop blackbishop · Accepted Answer · 2020-01-04T11:26:49

You could first get the struct you want using when and then use * to select the nested fields like this:

df.withColumn("c01", when($"c3" === "d", $"c0").otherwise($"c1"))
  .select($"c2", $"c3", $"c01.*")

For the other error:

otherwise() can only be applied on a Column previously generated by when()

You're simply missing a bracket as you call otherwise on df("c0") instead of the when column.

How to select specific columns from Spark DataFrame based on the value of another column?

1 Answers