How to compile a Spark job for Cloud Dataproc that reads CSV files from GCS?

Question

I'm trying to use Google Cloud Dataproc to run a Spark ML job on CSV files stored in GCS. But, I'm having trouble figuring out how to compile the fat JAR for submission.

I can tell from the docs that cloud dataproc nodes have the connector pre-installed, but I don't know how to add the connector to my SBT config so I can develop and compile a fat JAR locally to submit to dataproc. Is there a line I can add to my build.sbt so I have access to the connector locally (i.e. so it will compile)? And mark as "provided" if necessary so that it doesn't conflict with the version pre-installed on the worker nodes?

Any pointers or examples would be super appreciated.

TIA!

Angus Davis Angus Davis · Accepted Answer · 2017-11-22T22:40:55

To reference the connector explicitly in build.sbt:

libraryDependencies += "com.google.cloud.bigdataoss" % "gcs-connector" % "1.6.1-hadoop2" % "provided"

Note that this is the unshaded version of the library. On Dataproc clusters, the shaded artifact is provided and this artifact can be found on maven central with the 'shaded' classifier:

libraryDependencies += "com.google.cloud.bigdataoss" % "gcs-connector" % "1.6.1-hadoop2" % "provided" classifier "shaded"

For development, you may be able to get away with just using the Hadoop FileSystem interfaces and when deploying to Dataproc using paths of the form 'gs://[bucket]/[path/to/object]' instead of local paths.

How to compile a Spark job for Cloud Dataproc that reads CSV files from GCS?

1 Answers