I want to read a Spark Avro file in Jupyter notebook.
I have got the spark -avro built.
When I go to my directory and do the following
pyspark --packages org.apache.avro:avro-mapred:1.7.7,com.databricks:spark-avro_2.10:2.0.1
It is able to open a jupyter notebook in the browser and I can then run the following command and it reads properly.
sdf_entities = sqlContext.read.format("com.databricks.spark.avro").load("learning_entity.avro")
sdf_entities.cache().take(1)
However, I don't want to give the packages command every time I am opening up a pyspark notebook. Like if I have to use Spark-csv packages I just do
pyspark
in the terminal and it opens up a jupyter notebook with spark-csv package. I don't have to specifically give the packages command for spark-csv there.
But this doesn't seem to work for spark-avro.
Note: 1). I have configured the iphython/jupyter notebook command as "pyspark" in the configuration setting so whenever pyspark is called in terminal it opens up a jyupyter notebook automatically.
2). I have also added the path of both spark-csv and spark-avro in the spark-conf file in my spark/conf folder. Here is how the spark-defaults.conf file looks:
# Example:
# spark.master spark://master:7077
# spark.eventLog.enabled true
# spark.eventLog.dir hdfs://namenode:8021/directory
# spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 12g
spark.executor.memory 3g
spark.driver.maxResultSize 3g
spark.rdd.compress false
spark.storage.memoryFraction 0.5
spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value
spark.jars.packages com.databricks:spark-csv_2.11:1.4.0
spark-jars.packages com.databricks:spark-avro_2.10:2.0.1
Any help?