How does one use Spark with google cloud storage's "interoperability mode"?

Question

Google offers "s3-compatible" access to their Cloud Storage service in the form of something called "Interoperability Mode".

We're running spark on a closed network and our connection to the internet is through a proxy. Google's own hadoop connector for cloud storage doesn't have any configuration settings for a proxy, so we have to use the built-in spark s3a connector, which lets you set all the properties you'd need to set to use a proxy that's talking to the internet and the appropriate google URL endpoints via core-site.xml:

<!-- example xml -->
<name>fs.s3a.access.key</name>
<value>....</value>

<name>fs.s3a.secret.key</name>
<value>....</value>

<name>fs.s3a.endpoint</name>
<value>https://storage.googleapis.com</value>

<name>fs.s3a.connection.ssl.enabled</name>
<value>True</value>

<name>fs.s3a.proxy.host</name>
<value>proxyhost</value>

<name>fs.s3a.proxy.port</name>
<value>12345</value>

However, unlike with boto, which works fine with the proxy in our environment with similar settings, Spark is throwing a com.cloudera.com.amazonaws.services.s3.model.AmazonS3Exception when it tries to use our proxy that looks like this:

 com.cloudera.com.amazonaws.services.s3.model.AmazonS3Exception: 
   The provided security credentials are not valid.
   (Service: Amazon S3; Status Code: 403; Error Code: InvalidSecurity; 
   Request ID: null), S3 Extended Request ID: null

What am I doing wrong here, or is this simply unsupported?

In the same vein, I'm curious if this version of spark is even using the jets3t library? I'm finding conflicting information.

tamale tamale · Accepted Answer · 2016-05-12T13:24:56

I eventually figured this out. You have to remove some specific offending jars from the classpath. I've detailed my solution in a gist for future me. :)

https://gist.github.com/chicagobuss/6557dbf1ad97e5a09709

How does one use Spark with google cloud storage's "interoperability mode"?

1 Answers