0
votes

How to load text file into Hive orc external table?

create table MyDB.TEST (
 Col1 String,
 Col2 String,
 Col3 String,
 Col4 String)
 STORED AS INPUTFORMAT
   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
 OUTPUTFORMAT
    'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat';

I have already created above table as Orc. but while fetching data from table it show below error Failed with exception

java.io.IOException:org.apache.orc.FileFormatException: Malformed ORC file hdfs://localhost:9000/Ext/sqooporc/part-m-00000. Invalid postscript.

1
Not sure why you tagged sqoop... If that's where the text is coming from, you should specify thatOneCricketeer
Please accept the answer if that solved your problem :)Rahul

1 Answers

2
votes

There are multiple steps to that. Follows the details.

  1. Create a hive table which is able to read from the plain text file. Assuming that your file is a comma delimited file and your file is on HDFS on a location called /user/data/file1.txt, follows will be the syntax.

    create table MyDB.TEST (
      Col1 String,
      Col2 String,
      Col3 String,
      Col4 String
    )
    row format delimited
    fields terminated by ','
    location '/user/data/file1.txt';
    

Now you have a schema which is in sync with the format of the data you posses.

  1. Create another table with ORC schema

Now you need to create the ORC table as you were creating earlier. Here is a simpler syntax for creating that table.

create table MyDB.TEST_ORC (
  Col1 String,
  Col2 String,
  Col3 String,
  Col4 String)
STORED AS ORC;
  1. Your TEST_ORC table is an empty table now. You can populate this table using the data from TEST table using the following command.

    INSERT OVERWRITE TABLE TEST_ORC SELECT * FROM TEST;
    

The aforementioned statement will select all the records from TEST table and will try to write those records to TEST_ORC table. Since TEST_ORC is an ORC table, the data will be converted to ORC format on the fly when written into the table.

You can even check the storage location of TEST_ORC table for ORC files.

Now your data is in ORC format and your table TEST_ORC has the required schema to parse it. You may drop your TEST table now, if not needed.

Hope that helps!