1
votes

I am newbie to spark and am trying to load avro data to spark 'dataset' (spark 1.6) using java. I see some examples in scala but not in java. Any pointers to examples in java will be helpful. I tried to create a javaRDD and then convert it to 'dataset'. I believe there must be a straight forward way.

1
I also faced this problem actually, and I couldn't figure it out. Don't know how are you creating RDDs, but I was receiving them from Kafka without knowing the schema. So to create DataSet I had to change the format of sent data: instead of avro-serialized data a json-string. After that I simply used: session.read().json(JavaRDD);. Or if you still want to use avro, then I think the way is to put that in avro file, and session.read().format("avro").load("avrofile.avro"); (not sure of the format string value though). Still hope there is some simple way, so will add question to favorites.RadioLog
But maybe you'll find the appropriate for you example here spark.apache.org/docs/latest/sql-programming-guide.html. Just choose Java tab.RadioLog
I was able to read avro data using Dataset<Row> df = spark.read().format("com.databricks.spark.avro") .load("users.avro"); where users.avro is the data file and User.avsc is the schema that i used. But I am not able to convert Dataset<Row> to Dataset<User>. I tried Encoder<User> UserEncoder = Encoders.bean(User.class); /*(User.class is the avro generated class) */ Dataset<User> df = spark.read().format("com.databricks.spark.avro") .load("users.avro").as(UserEncoder);Pradeep

1 Answers

1
votes

first of all you need to set hadoop.home.dir

System.setProperty("hadoop.home.dir", "C:/app/hadoopo273/winutils-master/hadoop-2.7.1");

then create a Spark session specifying where the avro file will be located

SparkSession spark = SparkSession .builder().master("local").appName("ASH").config("spark.cassandra.connection.host", "127.0.0.1").config("spark.sql.warehouse.dir", "file:///C:/cygwin64/home/a622520/dev/AshMiner2/cass-spark-embedded/cassspark/cassspark.all/spark-warehouse/").getOrCreate();

In my code am using an embedded spark environement

// Creates a DataFrame from a specified file
Dataset<Row> df = spark.read().format("com.databricks.spark.avro") .load("./Ash.avro");
df.createOrReplaceTempView("words");
Dataset<Row> wordCountsDataFrame = spark.sql("select count(*) as total from words");
wordCountsDataFrame.show();

hope this helps