Spark Scala Split DataFrame by some value range

0

votes

Suppose I have a dataframe with a column named x with a value range of [0, 1]. I hope to split it by the value of column x with ranges like [0, 0.1), [0.1, 0.2)...[0.9, 1]. Is there a good and fast way to do that? I'm using Spark 2 in Scala.

Update: Ideally there should be 10 new dataframes that contain data for each range.

scalaapache-sparkspark-dataframeapache-spark-mllib

0

votes

If you meant to discretize a double typed column, you might just do this (multiply the column by 10 and then cast it to integer type, the column will be cut into 10 discrete bins):

import org.apache.spark.sql.types.IntegerType

val df = Seq(0.32, 0.5, 0.99, 0.72, 0.11, 0.03).toDF("A")
// df: org.apache.spark.sql.DataFrame = [A: double]

df.withColumn("new", ($"A" * 10).cast(IntegerType)).show
+----+---+
|   A|new|
+----+---+
|0.32|  3|
| 0.5|  5|
|0.99|  9|
|0.72|  7|
|0.11|  1|
|0.03|  0|
+----+---+

0

votes

Expanding on @Psidom's solution for creating ranges, here's one approach to create a dataframe for each range:

import org.apache.spark.sql.types.IntegerType
val df = Seq(0.2, 0.71, 0.95, 0.33, 0.28, 0.8, 0.73).toDF("x")
val df2 = df.withColumn("g", ($"x" * 10.0).cast(IntegerType))

df2.show
+----+---+
|   x|  g|
+----+---+
| 0.2|  2|
|0.71|  7|
|0.95|  9|
|0.33|  3|
|0.28|  2|
| 0.8|  8|
|0.73|  7|
+----+---+

val dfMap = df2.select($"g").distinct.
  collect.
  flatMap(_.toSeq).
  map( g => g -> df2.where($"g" === g) ).
  toMap

dfMap.getOrElse(3, null).show
+----+---+
|   x|  g|
+----+---+
|0.33|  3|
+----+---+

dfMap.getOrElse(7, null).show
+----+---+
|   x|  g|
+----+---+
|0.71|  7|
|0.73|  7|
+----+---+

[UPDATE]

If your ranges are irregular, you can define a function which maps a Double into the corresponding Int range id, then wrap it with a UDF, like in the following:

val g: Double => Int = x => x match {
  case x if (x >= 0.0 && x < 0.12345) => 1
  case x if (x >= 0.12345 && x < 0.4834) => 2
  case x if (x >= 0.4834 && x < 1.0) => 3
  case _ => 99  // catch-all
}

val groupUDF = udf(g)

val df = Seq(0.1, 0.2, 0.71, 0.95, 0.03, 0.09, 0.44, 5.0).toDF("x")
val df2 = df.withColumn("g", groupUDF($"x"))

df2.show
+----+---+
|   x|  g|
+----+---+
| 0.1|  1|
| 0.2|  2|
|0.71|  3|
|0.95|  3|
|0.03|  1|
|0.09|  1|
|0.44|  2|
| 5.0| 99|
+----+---+

Spark Scala Split DataFrame by some value range

2 Answers