1
votes

I'm running Spark in a Jupyter notebook (using the jupyter-scala kernel). I have a dataframe with columns of type String, and I want a new dataframe with these values as type Int. I've tried all the answers from this post:

How to change column types in Spark SQL's DataFrame?.

But I keep getting an error:

org.apache.spark.SparkException: Job aborted due to stage failure

In particular, I am getting this error message:

org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, Column 44: Decimal

So I went and looked at line 43:

/* 043 */ Decimal tmpDecimal6 = Decimal.apply(new java.math.BigDecimal(primitive5.toString()));

So far nothing that I've tried has worked.

Here is a simple example:

val dF = sqlContext.load("com.databricks.spark.csv", Map("path" -> "../P80001571-ALL.csv", "header" -> "true"))
val dF2 = castColumnTo( dF, "contbr_zip", IntegerType )
dF2.show


val dF = sqlContext.load("com.databricks.spark.csv", Map("path" -> 

where castColumnTo is defined as suggested by Martin Senne in the post mentioned above:

object DFHelper
  def castColumnTo( df: DataFrame, cn: String, tpe: DataType ) : DataFrame = {
    df.withColumn( cn, df(cn).cast(tpe) )
  }
}

This is the error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost): java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 97, Column 45: Decimal

Line 97 looks like this:

Decimal tmpDecimal18 = Decimal.apply(new java.math.BigDecimal(primitive17.toString()));
1

1 Answers

1
votes

I seem to have solved the issue; it was related to the way I was setting up Spark to run in the notebook.

This is what I had before:

@transient val Spark = new ammonite.spark.Spark

import Spark.{ sparkConf, sc, sqlContext }
sc
import sqlContext.implicits._
import sqlContext._

This is what I have now:

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
val conf = new SparkConf().setAppName("appname").setMaster("local")
val sc = new SparkContext(conf)
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)

sqlContext
import sqlContext._
import sqlContext.implicits._

Things seem to be working now.