15
votes

I'm trying to switch from reading csv flat files to avro files on spark. following https://github.com/databricks/spark-avro I use:

import com.databricks.spark.avro._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.avro("gs://logs.xyz.com/raw/2016/04/20/div1/div2/2016-04-20-08-28-35.UTC.blah-blah.avro")

and get

java.lang.UnsupportedOperationException: This mix of union types is not supported (see README): ArrayBuffer(STRING)

the readme file states clearly:

This library supports reading all Avro types, with the exception of complex union types. It uses the following mapping from Avro types to Spark SQL types:

when i try to textread the same file I can see the schema

val df = sc.textFile("gs://logs.xyz.com/raw/2016/04/20/div1/div2/2016-04-20-08-28-35.UTC.blah-blah.avro")
df.take(2).foreach(println)

{"name":"log_record","type":"record","fields":[{"name":"request","type":{"type":"record","name":"request_data","fields":[{"name":"datetime","type":"string"},{"name":"ip","type":"string"},{"name":"host","type":"string"},{"name":"uri","type":"string"},{"name":"request_uri","type":"string"},{"name":"referer","type":"string"},{"name":"useragent","type":"string"}]}}

<------- an excerpt of the full reply ------->

since I have little control on the format I'm getting these files in, my question here is - is there a workaround someone tested and can recommend?

I use gc dataproc with

MASTER=yarn-cluster spark-shell --num-executors 4 --executor-memory 4G --executor-cores 4 --packages com.databricks:spark-avro_2.10:2.0.1,com.databricks:spark-csv_2.11:1.3.0

any help would be greatly appreciated.....

1
you can use newHadoopApi for reading avro files - you'll need to use core api of spark and not sql-api. any special reason not to use it?Igor Berman

1 Answers

3
votes

You won't find any solution that works with Spark SQL. Every column in Spark has to contain values which can be represented as a single DataType so complex union types are simply not representable with Spark Dataframe.

If you want to read data like this you should use RDD API and convert loaded data to DataFrame later.