1
votes

How do I connect google cloud buckets to Apache Drill. I want to connect Apache Drill to google cloud storage buckets and fetch data from the file files stored in those buckets.

I can specify access id and key in core-site.xml in order to connect to AWS. Is there a similar way to connect drill to google cloud.

2

2 Answers

2
votes

I found the answer here useful: Apache Drill using Google Cloud Storage

On Google Cloud Dataproc you can set it up with an initialization action as in the answer above. There's also a complete one you can use which creates a GCS plugin for you, pointed by default at the ephemeral bucket created with your dataproc cluster.

If you're not using Cloud Dataproc you can do the following on your already-installed Drill cluster.

Get the GCS connector from somewhere and put it in Drill's 3rdparty jars directory. GCS configuration is detailed at the link above. On dataproc the connector jar is in /usr/lib/hadoop so the above initialization action does this:

# Link GCS connector to drill jars
ln -sf /usr/lib/hadoop/lib/gcs-connector-1.6.0-hadoop2.jar $DRILL_HOME/jars/3rdparty

You need to also configure core-site.xml and make it available to Drill. This is necessary so that Drill knows how to connect to GCS.

# Symlink core-site.xml to $DRILL_HOME/conf
ln -sf /etc/hadoop/conf/core-site.xml $DRILL_HOME/conf

Start or restart your drillbits as needed.

Once Drill is up, you can create a new plugin that points to a GCS bucket. First write out a JSON file containing the plugin configuration:

export DATAPROC_BUCKET=gs://your-bucket-name
cat > /tmp/gcs_plugin.json <<EOF
{
    "config": {
        "connection": "$DATAPROC_BUCKET",
        "enabled": true,
        "formats": {
            "avro": {
                "type": "avro"
            },
            "csv": {
                "delimiter": ",",
                "extensions": [
                    "csv"
                ],
                "type": "text"
            },
            "csvh": {
                "delimiter": ",",
                "extensions": [
                    "csvh"
                ],
                "extractHeader": true,
                "type": "text"
            },
            "json": {
                "extensions": [
                    "json"
                ],
                "type": "json"
            },
            "parquet": {
                "type": "parquet"
            },
            "psv": {
                "delimiter": "|",
                "extensions": [
                    "tbl"
                ],
                "type": "text"
            },
            "sequencefile": {
                "extensions": [
                    "seq"
                ],
                "type": "sequencefile"
            },
            "tsv": {
                "delimiter": "\t",
                "extensions": [
                    "tsv"
                ],
                "type": "text"
            }
        },
        "type": "file",
        "workspaces": {
            "root": {
                "defaultInputFormat": null,
                "location": "/",
                "writable": false
            },
            "tmp": {
                "defaultInputFormat": null,
                "location": "/tmp",
                "writable": true
            }
        }
    },
    "name": "gs"
}
EOF

Then POST the new plugin to any drillbit (I'm assuming you're running this on one of the drillbits):

curl -d@/tmp/gcs_plugin.json \
  -H "Content-Type: application/json" \
  -X POST http://localhost:8047/storage/gs.json

I believe you need to repeat this procedure changing the name ("gs" above) if you want Drill to query multiple buckets.

Then you can launch sqlline and check that you can query files in that bucket.

1
votes

I know this question is quite old, but still, here's the way to do this without using Dataproc.

Add the JAR file from the GCP connectors to the jars/3rdparty directory. Add the following to the site-core.xml file in the conf directory (change the upper-case values such as YOUR_PROJECT_ID to your own details):

<property>
    <name>fs.gs.project.id</name>
    <value>YOUR_PROJECT_ID</value>
    <description>
      Optional. Google Cloud Project ID with access to GCS buckets.
      Required only for list buckets and create bucket operations.
    </description>
  </property>
  <property>
    <name>fs.gs.auth.service.account.private.key.id</name>
    <value>YOUR_PRIVATE_KEY_ID</value>
  </property>
    <property>
        <name>fs.gs.auth.service.account.private.key</name>
        <value>-----BEGIN PRIVATE KEY-----\nYOUR_PRIVATE_KEY\n-----END PRIVATE KEY-----\n</value>
    </property>
  <property>
    <name>fs.gs.auth.service.account.email</name>
    <value>YOUR_SERVICE_ACCOUNT_EMAIL/value>
    <description>
      The email address is associated with the service account used for GCS
      access when fs.gs.auth.service.account.enable is true. Required
      when authentication key specified in the Configuration file (Method 1)
      or a PKCS12 certificate (Method 3) is being used.
    </description>
  </property>
  <property>
    <name>fs.gs.working.dir</name>
    <value>/</value>
    <description>
      The directory relative gs: uris resolve in inside of the default bucket.
    </description>
  </property>
   <property>
    <name>fs.gs.implicit.dir.repair.enable</name>
    <value>true</value>
    <description>
      Whether or not to create objects for the parent directories of objects
      with / in their path e.g. creating gs://bucket/foo/ upon deleting or
      renaming gs://bucket/foo/bar.
    </description>
  </property>
   <property>
    <name>fs.gs.glob.flatlist.enable</name>
    <value>true</value>
    <description>
      Whether or not to prepopulate potential glob matches in a single list
      request to minimize calls to GCS in nested glob cases.
    </description>
  </property>
   <property>
    <name>fs.gs.copy.with.rewrite.enable</name>
    <value>true</value>
    <description>
      Whether or not to perform copy operation using Rewrite requests. Allows
      to copy files between different locations and storage classes.
    </description>
  </property>

Start Apache Drill.

Add a custom storage to Drill.

You're good to go.

The solution is from here, where I detail some more about what we do around data exploration with Apache Drill.