I am trying to apply this idea https://fullstackml.com/how-to-check-hypotheses-with-bootstrap-and-apache-spark-cd750775286a to a dataframe I have. The code I'm using is this part:
import scala.util.Sorting.quickSort
def getConfInterval(input: org.apache.spark.rdd.RDD[Double], N: Int, left: Double, right:Double)
: (Double, Double) = {
// Simulate by sampling and calculating averages for each of subsamples
val hist = Array.fill(N){0.0}
for (i <- 0 to N-1) {
hist(i) = input.sample(withReplacement = true, fraction = 1.0).mean
}
// Sort the averages and calculate quantiles
quickSort(hist)
val left_quantile = hist((N*left).toInt)
val right_quantile = hist((N*right).toInt)
return (left_quantile, right_quantile)
}
This runs ok, but when I try to apply this to:
val data = mydf.map( _.toDouble )
val (left_qt, right_qt) = getConfInterval(data, 1000, 0.025, 0.975)
val H0_mean = 30
if (left_qt < H0_mean && H0_mean < right_qt) {
println("We failed to reject H0. It seems like H0 is correct.")
} else {
println("We rejected H0")
}
I get an error
error: value toDouble is not a member of org.apache.spark.sql.Row val data = dfTeste.map( _.toDouble )
And when I do it without the
.map( _.toDouble
I get:
notebook:4: error: type mismatch; found : org.apache.spark.sql.DataFrame (which expands to) org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] required: org.apache.spark.rdd.RDD[Double]
mydf is basically a dataframe that I selected only one column (that has type double, several rows of either 0.0 or 1.0)
When I do:
dfTeste.map(x=>x.toString()).rdd
It successfully it turns to org.apache.spark.rdd.RDD[String] but I can't find a way to do this for Double. I'm very new to this so I apologize if it doesn't make much sense.
val data = mydf.map( _.toDouble )
? – antonioACR1mydf
is a dataframe, then you should get an error when you definedata
– antonioACR1mydf.map( _.toDouble )
and if I don't do the map I get the error of type mismatch – Carolina