
Created a spark cluster through gcloud console with following options

gcloud dataproc clusters create cluster-name --region us-east1 --num-masters 1 --num-workers 2 --master-machine-type n1-standard-2 --worker- machine-type n1-standard-1 --metadata spark-packages=graphframes:graphframes:0.2.0-spark2.1-s_2.11

On spark master node - launched pyspark shell as follows:

pyspark --packages graphframes:graphframes:0.2.0-spark2.0-s_2.11


found graphframes#graphframes;0.2.0-spark2.0-s_2.11 in spark-packages

[SUCCESSFUL ] graphframes#graphframes;0.2.0-spark2.0-s_2.11!graphframes.jar (578ms)


    graphframes#graphframes;0.2.0-spark2.0-s_2.11 from spark-packages in [default]
    org.scala-lang#scala-reflect;2.11.0 from central in [default]
    org.slf4j#slf4j-api;1.7.7 from central in [default]
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    |      default     |   5   |   5   |   5   |   0   ||   5   |   5   |


Using Python version 2.7.9 (default, Jun 29 2016 13:08:31) SparkSession available as 'spark'.

>>> from graphframes import *

Traceback (most recent call last): File "", line 1, in ImportError: No module named graphframes

How do I load graphframes on gcloud dataproc spark cluster?

--packages specifies Java/Scala packages, right? Is there a python package you need to download as well? If you have to pip install graphframes, please ensure it doesn't depend on the pyspark or py4j packages. Installing either one of those through pip will break pyspark on your cluster :( Instead, just install graphframes without those dependencies.Karthik Palaniappan

1 Answers
