How can I convert org.apache.spark.sql.DataFrame to org.apache.spark.rdd.RDD[Double]?

Question

I am trying to apply this idea https://fullstackml.com/how-to-check-hypotheses-with-bootstrap-and-apache-spark-cd750775286a to a dataframe I have. The code I'm using is this part:

import scala.util.Sorting.quickSort

def getConfInterval(input: org.apache.spark.rdd.RDD[Double], N: Int, left: Double, right:Double)
            : (Double, Double) = {
    // Simulate by sampling and calculating averages for each of subsamples
    val hist = Array.fill(N){0.0}
    for (i <- 0 to N-1) {
        hist(i) = input.sample(withReplacement = true, fraction = 1.0).mean
    }

    // Sort the averages and calculate quantiles
    quickSort(hist)
    val left_quantile  = hist((N*left).toInt)
    val right_quantile = hist((N*right).toInt)
    return (left_quantile, right_quantile)
}

This runs ok, but when I try to apply this to:

val data = mydf.map( _.toDouble )

val (left_qt, right_qt) = getConfInterval(data, 1000, 0.025, 0.975)

val H0_mean = 30
if (left_qt < H0_mean && H0_mean < right_qt) {
    println("We failed to reject H0. It seems like H0 is correct.")
} else {
    println("We rejected H0")
}

I get an error

error: value toDouble is not a member of org.apache.spark.sql.Row val data = dfTeste.map( _.toDouble )

And when I do it without the

.map( _.toDouble

I get:

notebook:4: error: type mismatch; found : org.apache.spark.sql.DataFrame (which expands to) org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] required: org.apache.spark.rdd.RDD[Double]

mydf is basically a dataframe that I selected only one column (that has type double, several rows of either 0.0 or 1.0)

When I do:

dfTeste.map(x=>x.toString()).rdd

It successfully it turns to org.apache.spark.rdd.RDD[String] but I can't find a way to do this for Double. I'm very new to this so I apologize if it doesn't make much sense.

Can you show the output of the following val data = mydf.map( _.toDouble )? — antonioACR1
If mydf is a dataframe, then you should get an error when you define data — antonioACR1
Yes, I fixed the post because I mixed the errors. I get an error on mydf.map( _.toDouble )and if I don't do the map I get the error of type mismatch — Carolina

Raphael Roth Raphael Roth · Accepted Answer · 2018-10-03T18:34:16

apparently val data = mydf.map( _.toDouble ) is not an RDD[Double] but a DataFrame

In the example you linked, they used

val data = dataWithHeader.filter( _ != header ).map( _.toDouble )

which is a RDD[Double] (sc.textFile returns a RDD)

So you need to convert mydf to an RDD, you can do this with e.g. :

val data = mydf.map(r => r.getDouble(0)).rdd

How can I convert org.apache.spark.sql.DataFrame to org.apache.spark.rdd.RDD[Double]?

1 Answers