I want to convert RDD to DataFrame using StructType. But item "Broken,Line," would cause error. Is there an elegant way to process record like this? Thanks.
import org.apache.spark.sql.types.{StructType, StructField, StringType}
import org.apache.spark.sql.Row
val mySchema = StructType(Array(
StructField("colA", StringType, true),
StructField("colB", StringType, true),
StructField("colC", StringType, true)))
val x = List("97573,Start,eee", "9713,END,Good", "Broken,Line,")
val inputx = sc.parallelize(x).
| map((x:String) => Row.fromSeq(x.split(",").slice(0,mySchema.size).toSeq))
val df = spark.createDataFrame(inputx, mySchema)
df.show
Error would be like this:
Name: org.apache.spark.SparkException Message: Job aborted due to stage failure: Task 0 in stage 14.0 failed 1 times, most recent failure: Lost task 0.0 in stage 14.0 (TID 14, localhost, executor driver): java.lang.RuntimeException: Error while encoding: java.lang.ArrayIndexOutOfBoundsException: 2
I'm using:
- Spark: 2.2.0
- Scala: 2.11.8
And I ran the code in spark-shell.
x.split("," , -1)if you're worried about that part. - philantrovert