Reading Spark Avro file in Jupyter notebook with Pyspark Kernel

Question

I want to read a Spark Avro file in Jupyter notebook.

I have got the spark -avro built.

When I go to my directory and do the following

pyspark --packages org.apache.avro:avro-mapred:1.7.7,com.databricks:spark-avro_2.10:2.0.1

It is able to open a jupyter notebook in the browser and I can then run the following command and it reads properly.

sdf_entities = sqlContext.read.format("com.databricks.spark.avro").load("learning_entity.avro")
sdf_entities.cache().take(1)

However, I don't want to give the packages command every time I am opening up a pyspark notebook. Like if I have to use Spark-csv packages I just do

pyspark

in the terminal and it opens up a jupyter notebook with spark-csv package. I don't have to specifically give the packages command for spark-csv there.

But this doesn't seem to work for spark-avro.

Note: 1). I have configured the iphython/jupyter notebook command as "pyspark" in the configuration setting so whenever pyspark is called in terminal it opens up a jyupyter notebook automatically.

2). I have also added the path of both spark-csv and spark-avro in the spark-conf file in my spark/conf folder. Here is how the spark-defaults.conf file looks:

# Example:
# spark.master                     spark://master:7077
# spark.eventLog.enabled           true
# spark.eventLog.dir               hdfs://namenode:8021/directory
# spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.driver.memory              12g
spark.executor.memory            3g
spark.driver.maxResultSize       3g
spark.rdd.compress               false
spark.storage.memoryFraction     0.5


spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value

spark.jars.packages    com.databricks:spark-csv_2.11:1.4.0
spark-jars.packages    com.databricks:spark-avro_2.10:2.0.1

Any help?

zero323 zero323 · Accepted Answer · 2017-02-07T02:04:52

The correct property name is spark.jars.packages (not spark-jars.packages) and multiple packages should be provided as a single, comma separated list, same as the command line argument.

You should also use the same Scala artifact, which matches Scala version used to build Spark binaries. For example with Scala 2.10 (default in Spark 1.x):

spark.jars.packages  com.databricks:spark-avro_2.10:2.0.1,com.databricks:spark-csv_2.10:1.5.0

Reading Spark Avro file in Jupyter notebook with Pyspark Kernel

1 Answers