I want to write a DataFrame in Avro format using a provided Avro schema rather than Spark's auto-generated schema. How can I tell Spark to use my custom schema on write?
{
"type" : "record",
"name" : "name1",
"namespace" : "com.data"
"fields" : [
{
"name" : "id",
"type" : "string"
},
{
"name" : "count",
"type" : "int"
},
{
"name" : "val_type",
"type" : {
"type" : "enum",
"name" : "ValType"
"symbols" : [ "s1", "s2" ]
}
}
]
}
avro reading with the using of avroSchema. On this step everything is ok.
Dataset d1 = spark .read() .option("avroSchema",String.valueOf(inAvroSchema)) .format("com.databricks.spark.avro") .load("s3_path");
here I perform some spark.sql on above data and storing to DataFrame.
When I tried to write avro data to s3 based on avro schema
DF datatypes:
root
|-- id: string (nullable = true)
|-- count: integer (nullable = true)
|-- val_type: string (nullable = true)
FinalDF.write().option("avroSchema",String.valueOf(inAvroSchema)).format("com.databricks.spark.avro").mode("overwrite").save("target_s3_path");
I got the error:
User class threw exception: org.apache.spark.SparkException: Job aborted.
......
Caused by: org.apache.avro.AvroRuntimeException: **Not a union: "string"**
at org.apache.avro.Schema.getTypes(Schema.java:299)
at
org.apache.spark.sql.avro.AvroSerializer.org$apache$spark$sql$avro$AvroSerializer$$resolveNullableType(AvroSerializer.scala:229)
Is there any way to use avro schema for writing the avro data or if it's right approach (with "option("avroSchema",String.valueOf(inAvroSchema))"
) - may be I'm doing something wrong? "forceSchema" option
doesn't work in my case.
Thanks in advance.
"type" : "enum",
, can you change it to string and check if it works? – Sai Kiran KrishnaMurthy