Read avro data using spark dataset in java

Question

I am newbie to spark and am trying to load avro data to spark 'dataset' (spark 1.6) using java. I see some examples in scala but not in java. Any pointers to examples in java will be helpful. I tried to create a javaRDD and then convert it to 'dataset'. I believe there must be a straight forward way.

I also faced this problem actually, and I couldn't figure it out. Don't know how are you creating RDDs, but I was receiving them from Kafka without knowing the schema. So to create DataSet I had to change the format of sent data: instead of avro-serialized data a json-string. After that I simply used: session.read().json(JavaRDD);. Or if you still want to use avro, then I think the way is to put that in avro file, and session.read().format("avro").load("avrofile.avro"); (not sure of the format string value though). Still hope there is some simple way, so will add question to favorites. — RadioLog
But maybe you'll find the appropriate for you example here spark.apache.org/docs/latest/sql-programming-guide.html. Just choose Java tab. — RadioLog
I was able to read avro data using Dataset<Row> df = spark.read().format("com.databricks.spark.avro") .load("users.avro"); where users.avro is the data file and User.avsc is the schema that i used. But I am not able to convert Dataset<Row> to Dataset<User>. I tried Encoder<User> UserEncoder = Encoders.bean(User.class); /*(User.class is the avro generated class) */ Dataset<User> df = spark.read().format("com.databricks.spark.avro") .load("users.avro").as(UserEncoder); — Pradeep

Patrick Boulay Patrick Boulay · Accepted Answer · 2016-10-11T11:40:09

first of all you need to set hadoop.home.dir

System.setProperty("hadoop.home.dir", "C:/app/hadoopo273/winutils-master/hadoop-2.7.1");

then create a Spark session specifying where the avro file will be located

SparkSession spark = SparkSession .builder().master("local").appName("ASH").config("spark.cassandra.connection.host", "127.0.0.1").config("spark.sql.warehouse.dir", "file:///C:/cygwin64/home/a622520/dev/AshMiner2/cass-spark-embedded/cassspark/cassspark.all/spark-warehouse/").getOrCreate();

In my code am using an embedded spark environement

// Creates a DataFrame from a specified file
Dataset<Row> df = spark.read().format("com.databricks.spark.avro") .load("./Ash.avro");
df.createOrReplaceTempView("words");
Dataset<Row> wordCountsDataFrame = spark.sql("select count(*) as total from words");
wordCountsDataFrame.show();

hope this helps

Read avro data using spark dataset in java

1 Answers