20
votes

I am reading schema of the data frame from a text file. The file looks like

id,1,bigint
price,2,bigint
sqft,3,bigint
zip_id,4,int
name,5,string

and I am mapping parsed data types to Spark Sql datatypes.The code for creating data frame is -

var schemaSt = new ListBuffer[(String,String)]()
// read schema from file
for (line <- Source.fromFile("meta.txt").getLines()) {
  val word = line.split(",")
  schemaSt += ((word(0),word(2)))
}

// map datatypes
val types = Map("int" -> IntegerType, "bigint" -> LongType)
      .withDefault(_ => StringType)
val schemaChanged = schemaSt.map(x => (x._1,types(x._2))

// read data source
val lines = spark.sparkContext.textFile("data source path")

val fields = schemaChanged.map(x => StructField(x._1, x._2, nullable = true)).toList

val schema = StructType(fields)

val rowRDD = lines
  .map(_.split("\t"))
  .map(attributes => Row.fromSeq(attributes))

// Apply the schema to the RDD
val new_df = spark.createDataFrame(rowRDD, schema)
new_df.show(5)
new_df.printSchema()

but the above works only for StringType. For IntegerType and LongType, it is throwing exceptions -

java.lang.RuntimeException: java.lang.String is not a valid external type for schema of int

and

java.lang.RuntimeException: java.lang.String is not a valid external type for schema of bigint.

Thanks in advance!

2

2 Answers

11
votes

I had the same problem and its cause is the Row.fromSeq() call.

If it is called on the array of String, the resulting Row is the row of String's. Which does not match the type of the 2nd column in your schema (bigint or int).

In order to get the valid dataframe as a result of Row.fromSeq(values: Seq[Any]), the elements of the values argument have to be of the type that corresponds to your schema.

10
votes

You are trying to store strings in numerically typed columns.

You need to cast string encoded numerical data to the appropriate numerical types while parsing.