How to query hdfs part file from apache drill

Question

I am trying to query my HDFS file system from apache drill. I have successfully able to query hive table , csv files but part files are not working.

hadoop fs -cat BANK_FINAL/2015-11-02/part-r-00000 | head -1

Gives result:

028|S80306432|2015-11-02|BRN-CLG-CHQ PAID TO SILVER ROCK BANDRA CO-OP|485|ZONE SERIAL [ 485]|L|I|MAHARASHTRA STATE CO-OP BANK LTD|3320.0|INWARD CLG|D11528|SBPRM

select * from dfs.`/user/ituser1/e.csv` limit 10

works fine and gives result successfully.

But when I try query

select * from dfs.`/user/ituser1/BANK_FINAL/2015-11-02/part-r-00000` limit 10

Gives error:

org.apache.drill.common.exceptions.UserRemoteException: VALIDATION ERROR: From line 1, column 15 to line 1, column 17: Table 'dfs./user/ituser1/BANK_FINAL/2015-11-02/part-r-00000' not found [Error Id: 6f80392a-51af-4b61-94d8-335b33b0048c on genome-dev13.axs:31010]

Apache Drill dfs storage plugin json is as follows:

{
  "type": "file",
  "enabled": true,
  "connection": "hdfs://10.9.1.33:8020/",
  "workspaces": {
    "root": {
      "location": "/",
      "writable": true,
      "defaultInputFormat": null
    },
    "tmp": {
      "location": "/tmp",
      "writable": true,
      "defaultInputFormat": null
    }
  },
  "formats": {
    "psv": {
      "type": "text",
      "extensions": [
        "psv"
      ],
      "delimiter": "|"
    },
    "csv": {
      "type": "text",
      "extensions": [
        "csv"
      ],
      "delimiter": ","
    },
    "tsv": {
      "type": "text",
      "extensions": [
        "tsv"
      ],
      "delimiter": "\t"
    },
    "parquet": {
      "type": "parquet"
    },
    "json": {
      "type": "json"
    },
    "avro": {
      "type": "avro"
    },
    "sequencefile": {
      "type": "sequencefile",
      "extensions": [
        "seq"
      ]
    },
    "csvh": {
      "type": "text",
      "extensions": [
        "csvh"
      ],
      "extractHeader": true,
      "delimiter": ","
    }
  }
}

adeneche adeneche · Accepted Answer · 2016-03-19T07:44:44

Drill uses file extension to figure out the file type, apart for parquet files where it tries to read a magic number from the file. In your case, you need to define the "defaultInputFormat" to indicate that by default any file without an extension is a CSV file. You can find more information here:

https://drill.apache.org/docs/drill-default-input-format/

How to query hdfs part file from apache drill

1 Answers