1
votes

Trying to read an avro file.

val df = spark.read.avro(file)

Running into Avro schema cannot be converted to a Spark SQL StructType: [ "null", "string" ]

Tried to manually create a schema, but now running into the following:

val s = StructType(List(StructField("value", StringType, nullable = true)))

val df = spark.read
  .option("inferSchema", "false")
  .schema(s)
  .avro(file)

com.databricks.spark.avro.SchemaConverters$IncompatibleSchemaException: Cannot convert Avro schema to catalyst type because schema at path is not compatible (avroType = StructType(StructField(value,StringType,true)), sqlType = STRING). Source Avro schema: ["null","string"]. Target Catalyst type: StructType(StructField(value,StringType,true))

Trying to override the avro schema (without the null) also does not work:

val df = spark.read
  .option("inferSchema", "false")
  .option("avroSchema", """["string"]""")
  .avro(file)

Avro schema cannot be converted to a Spark SQL StructType: [ "string" ]

Looks like spark-avro only creates a GenericDatumReader[GenericRecord] and I need a GenericDatumReader[Utf8] :(

1
Did you tried this val df = spark.read.option("inferSchema", "true").avro(file)Kaushal
Yes, but to same result, spark (correctly) determines the schema is a union of ["null", "string"].timvw

1 Answers

1
votes

Please make sure you are providing the correct AVSC with the data type. ["null", "String"] is placed to take care of null values in the Avro data. You can create the schema of your Avro file by:-

val schema = new Schema.Parser().parse(new File("user.avsc")

Or if you have Java Schema file then you can get the schema by doing:-

val schema = Schema.getClassSchema

now once you have the schema it is very simple to build a data frame with it. code snippet:-

val df =sparkSession.read.format("com.databricks.spark.avro")
      .option("avroSchema", schema.toString)
      .load("/home/garvit.vijay/000009_0.avro")

df.printSchema()
df.show()

Hope it works for you.