1
votes

I am trying to map the values from a csv file into an RDD but I get the following error because some of the fields are null.

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.NumberFormatException: empty String

Following is the code I am using.

// Load and parse the data
val data = sc.textFile("data.csv")

val parsedData = data.map(s => Vectors.dense(s.split(',').map(_.toDouble))).cache()

Is there any way to check if there is a null? I thought of doing it with a try catch method but it doesn't seem to work.

val parsedData = data.map(s => {

  try {
    val vector = Vectors.dense(s.split(',').map(_.toDouble))
  }catch{
    case e:NumberFormatException => println("Nulls somewhere")
  }
  (vector)
})
1
spark-csv package can be used to read the csv data. Refer this stackoverflow.com/questions/29704333/…. If you want the underlying RDD, call rdd() on Dataframe object.shriyog

1 Answers

2
votes

You can filter out item, that are empty, just add filter method into your stream:

val parsedData = data.map(s => Vectors.dense(s.split(',').filter(!_.isEmpty).map(_.toDouble))).filter(_.size != 0)

This way any empty line will result in empty Vector, that can be further filtered.