Created a spark cluster through gcloud console with following options
gcloud dataproc clusters create cluster-name --region us-east1 --num-masters 1 --num-workers 2 --master-machine-type n1-standard-2 --worker- machine-type n1-standard-1 --metadata spark-packages=graphframes:graphframes:0.2.0-spark2.1-s_2.11
On spark master node - launched pyspark shell as follows:
pyspark --packages graphframes:graphframes:0.2.0-spark2.0-s_2.11
...
found graphframes#graphframes;0.2.0-spark2.0-s_2.11 in spark-packages
[SUCCESSFUL ] graphframes#graphframes;0.2.0-spark2.0-s_2.11!graphframes.jar (578ms)
...
graphframes#graphframes;0.2.0-spark2.0-s_2.11 from spark-packages in [default]
org.scala-lang#scala-reflect;2.11.0 from central in [default]
org.slf4j#slf4j-api;1.7.7 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 5 | 5 | 5 | 0 || 5 | 5 |
---------------------------------------------------------------------
...
Using Python version 2.7.9 (default, Jun 29 2016 13:08:31) SparkSession available as 'spark'.
>>> from graphframes import *
Traceback (most recent call last): File "", line 1, in ImportError: No module named graphframes
How do I load graphframes on gcloud dataproc spark cluster?
--packages
specifies Java/Scala packages, right? Is there a python package you need to download as well? If you have topip install graphframes
, please ensure it doesn't depend on thepyspark
orpy4j
packages. Installing either one of those throughpip
will breakpyspark
on your cluster :( Instead, just installgraphframes
without those dependencies. – Karthik Palaniappan