0
votes

I am trying to work with Apache Drill. I am new to this whole environment, just trying to understand how Apache Drill works.

I am trying to query my json data stored on s3 using Apache Drill. My bucket is created in US East (N. Virginia).
I have created a new Storage Plugin for S3 using this link.

Here is the configuration for my new S3 Storage Plugin :

{
  "type": "file",
  "enabled": true,
  "connection": "s3a://testing-drill/",
  "config": {
    "fs.s3a.access.key": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
    "fs.s3a.secret.key": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
  },
  "workspaces": {
    "root": {
      "location": "/",
      "writable": false,
      "defaultInputFormat": null,
      "allowAccessOutsideWorkspace": false
    },
    "tmp": {
      "location": "/tmp",
      "writable": true,
      "defaultInputFormat": null,
      "allowAccessOutsideWorkspace": false
    }
  },
  "formats": {
    "psv": {
      "type": "text",
      "extensions": [
        "tbl"
      ],
      "delimiter": "|"
    },
    "csv": {
      "type": "text",
      "extensions": [
        "csv"
      ],
      "delimiter": ","
    },
    "tsv": {
      "type": "text",
      "extensions": [
        "tsv"
      ],
      "delimiter": "\t"
    },
    "parquet": {
      "type": "parquet"
    },
    "json": {
      "type": "json",
      "extensions": [
        "json"
      ]
    },
    "avro": {
      "type": "avro"
    },
    "sequencefile": {
      "type": "sequencefile",
      "extensions": [
        "seq"
      ]
    },
    "csvh": {
      "type": "text",
      "extensions": [
        "csvh"
      ],
      "extractHeader": true,
      "delimiter": ","
    }
  }
}

I have also configured my core-site-example.xml as follows:

<configuration>

    <property>
        <name>fs.s3a.access.key</name>
        <value>xxxxxxxxxxxxxxxxxxxx</value>
    </property>

    <property>
        <name>fs.s3a.secret.key</name>
        <value>xxxxxxxxxxxxxxxxxxxxxxxx</value>
    </property>

    <property>
        <name>fs.s3a.endpoint</name>
        <value>s3.us-east-1.amazonaws.com</value>
    </property>

</configuration>

But when I try to use/set the workspace using the following command :

USE shiv.`root`;

It gives me following error :

Error: VALIDATION ERROR: Schema [shiv.root] is not valid with respect to either root schema or current default schema.

Current default schema:  No default schema selected

[Error Id: 6d9515c0-b90f-48aa-9dc5-0c660f1c06ca on ip-10-0-3-241.ec2.internal:31010] (state=,code=0)

If try to execute show schemas;, then I get the following error :

show schemas;
Error: SYSTEM ERROR: AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: EEB438A6A0A5E667, AWS Error Code: null, AWS Error Message: Bad Request

Fragment 0:0

[Error Id: 85883537-9b4f-4057-9c90-cdaedec116a8 on ip-10-0-3-241.ec2.internal:31010] (state=,code=0)

I am not able to understand the root cause of this issue.

1
Seems to me that the configuration is set wrong since it says Bad Request. Maybe step through the setup again? - tobi6
Can you add "fs.s3a.endpoint": "s3.us-east-1.amazonaws.com" in config section of your storage plugin definition? Also, you may rename core-site-example.xml to core-site .xml, restart drill-bit and try. - InfamousCoconut
The same configuration worked when I launched the instance in public subnet and started the service. I was able to work with Drill. I don't know what was the issue but it is resolved. - Shivkumar Mallesappa

1 Answers

1
votes

I had a similar issue when using Apache Drill with GCS(Google Cloud Storage)

I was getting the following error when running USE gcs.data query.

VALIDATION ERROR: Schema [gcs.data] is not valid with respect to either root schema or current default schema.

Current default schema:  No default schema selected

I ran SHOW SCHEMAS and there was no gcs.data schema.

I went ahead and created data folder in my GCS bucket and gcs.data showed up in SHOW SCHEMAS and USE gcs.data query worked.

From my limited experience with apache drill what I understood is that, In file storage, if you have a workspace that uses a folder that does not exist then drill will throw this error.

GCS and S3 both are file type storage so maybe you are having this issue.


Here is my GCS storage config

{
  "type": "file",
  "connection": "gs://my-gcs-bkt",
  "config": null,
  "workspaces": {
    "data": {
      "location": "/data",
      "writable": true,
      "defaultInputFormat": null,
      "allowAccessOutsideWorkspace": false
    },
    "tmp": {
      "location": "/tmp",
      "writable": true,
      "defaultInputFormat": null,
      "allowAccessOutsideWorkspace": false
    },
    "root": {
      "location": "/",
      "writable": false,
      "defaultInputFormat": null,
      "allowAccessOutsideWorkspace": false
    }
  },
  "formats": {
    "parquet": {
      "type": "parquet"
    },
    "json": {
      "type": "json",
      "extensions": [
        "json"
      ]
    },
    "tsv": {
      "type": "text",
      "extensions": [
        "tsv"
      ],
      "delimiter": "\t"
    },
    "csvh": {
      "type": "text",
      "extensions": [
        "csvh"
      ],
      "extractHeader": true,
      "delimiter": ","
    },
    "csv": {
      "type": "text",
      "extensions": [
        "csv"
      ],
      "delimiter": ","
    },
    "psv": {
      "type": "text",
      "extensions": [
        "tbl"
      ],
      "delimiter": "|"
    }
  },
  "enabled": true
}