1
votes

I am having an issue with Hive external table creation with Avro data / schema:

Steps followed:

  1. Imported data from MySQL - HDFS as AVRO.
  2. Transferred the .avsc file from local to HDFS [ Opened the file and the schema is as expected and fine ]
  3. Verified the Data is present in HDFS as a result of SQOOP import.
  4. Now created an external table pointing schema to step #2 and data location to step # 3.
  5. Hive command line states the OK, table created. ShotTables displays the table and verified the file location tagging from hue is all fine.
  6. When Query the Table from HIVE command line, getting an error:

    java.io.IOException:java.io.IOException: Not a data file.

hive> create external table departmentsAvro2 row format serde 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' stored as inputformat 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' outputformat 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' location 'hdfs://quickstart.cloudera/user/cloudera/sqoopAvro' tblproperties ('avro.schema.url'='hdfs://quickstart.cloudera/user/cloudera/departments.avsc');

Output:

OK
Time taken: 0.092 seconds

hive> show tables;

Output:

OK
departmentsavro2
order_items
orders
Time taken: 0.016 seconds, Fetched: 12 row(s)

hive> select * from departmentsavro2;

Output:

OK
Failed with exception java.io.IOException:java.io.IOException: Not a data file.
Time taken: 0.145 seconds

As suggested in some threads provided all necessary RWX permissions to the .avsc / data files in HDFS.

Any pointers?

2
Verified from Hive command that the data and schema are existing not sure whats wrong, only when querying table: hive> dfs -cat hdfs://quickstart.cloudera/user/cloudera/sqoopAvro/part*; 2,Fitness 3,Footwear 4,Apparel 5,Golf 6,Outdoors 7,Fan Shop 10000,apparel hive> - Ramesh devarakonda
hive> dfs -cat hdfs://quickstart.cloudera/user/cloudera/departments.avsc; { "type" : "record", "name" : "departments", "doc" : "Sqoop import of departments", "fields" : [ { "name" : "department_id", "type" : [ "null", "int" ], "default" : null, "columnName" : "department_id", "sqlType" : "4" }, { "name" : "department_name", "type" : [ "null", "string" ], "default" : null, "columnName" : "department_name", "sqlType" : "12" } ], "tableName" : "departments" - Ramesh devarakonda

2 Answers

4
votes

I have been working around this issue for a couple of days also.... finally, find out that, I was storing the AVSC file in the same directory as the AVRO file. For some reason, this is causing the issue. Then, the fix would be:

  1. Create different directories for avro and avsc files:

$ hadoop fs -mkdir /user/hduser/idl

$ hadoop fs -mkdir /user/hduser/data

  1. Move each file to its place:

$ hadoop fs -mv /user/hduser/avrofile.avsc /user/hduser/idl

$ hadoop fs -mv /user/hduser/avrofile.avro /user/hduser/data

  1. Re-create the table

DROP TABLE external_avro_table;

CREATE EXTERNAL TABLE external_avro_table
 ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
 STORED AS INPUTFORMAT  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
 OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
 LOCATION 'hdfs://namenode:54310/user/hduser/data'
 TBLPROPERTIES ('avro.schema.url'='hdfs://namenode:54310/user/hduser/idl/avrofile.avsc');

4. Do not mix anything else with your avro files. Hive will try to refer to anything in the file location as having avro format .

Hope this helps.

0
votes

Are you using the same folder where you imported data from sqoop?

If yes, as all the mapreduce job sqoop also creates _SUCCESS file in the output folder which is not identified by the Avroserde. So, remove the _SUCCESS file from the folder you are pointing in hive create query, then the select query will work fine.