Can we load Parquet file into Hive directly?

25

votes

I know we can load parquet file using Spark SQL and using Impala but wondering if we can do the same using Hive. I have been reading many articles but I am still confused.

Simply put, I have a parquet file - say users.parquet. Now I am struck here on how to load/insert/import data from the users.parquet into hive (obviously into a table).

Please advise or point me in right direction if I am missing something obvious.

Creating hive table using parquet file metadata

https://phdata.io/examples-using-textfile-and-parquet-with-hive-and-impala/

hadoophiveapache-spark-sqlhiveqlparquet

A trick I have done with ORC files (to clone a Prod table into a Test cluster, actually): create an non-partitioned table with the same exact structure; copy the data file(s) to the directory used by the new table; voila, the table is populated. Might work with Parquet too. – Samson Scharfrichter

13

votes

Get schema of the parquet file using parquet tools, for details check link http://kitesdk.org/docs/0.17.1/labs/4-using-parquet-tools-solution.html

and build table using the schema on the top of the file, for details check Create Hive table to read parquet files from parquet/avro schema

11

votes

Getting the schema is crucial, as you will have to create the table with the appropriate schema first in Hive and then point it to the parquet files.

I had a similar problem, where I had data in one VM and had to move it to another. Here is my walkthrough:

Find out about original Parquet files are (location and schema): describe formatted users; and show create table users; The latter will get you the schema right away and also point you to the location of HDFS hdfs://hostname:port/apps/hive/warehouse/users
Find out about the partitioning of your table show partitions users;
Copy the table's Parquet files from HDFS to a local directory
```
hdfs dfs -copyToLocal /apps/hive/warehouse/users
```
Move them across to the other cluster/VM or where you want them to go
Create the users table on your destination CREATE USERS ... by using the same schema
```
CREATE TABLE users ( name string, ... )
PARTITIONED BY (...)
STORED AS PARQUET;
```
Now, move the Parquet files in the respective folder (if needed find out about the location of the table you've just created)
```
hdfs dfs -ls /apps/hive/warehouse/users/
hdfs dfs -copyFromLocal ../temp/* /apps/hive/warehouse/
```
For each partition, you'll have to point Hive to the respective subdirectory: alter table users add partition (sign_up_date='19991231') location '/apps/hive/warehouse/users/sign_up_date=19991231'; (you might want to do this with a bash script)

That worked for me, hope it helps.

6

votes

Don't know if it's a bit "hacky" but I use zeppelin (shipped with ambari). You can simply do the following in combination with spark2:

%spark2
import org.apache.spark.sql.SaveMode

var df = spark.read.parquet("hdfs:///my_parquet_files/*.parquet");
df.write.mode(SaveMode.Overwrite).saveAsTable("imported_table")

The advantage of this way is that you can also import many parquet files even if they have a different schema.

0

votes

You can try this... The export/import works for all types of file format including parquet in Hive. This is general concept, you can tweak little bit based on your requirement like load from local (or) across cluster

Note: You can hard code instead of $ when you execute individual steps, also pass the "HDFS path", "Schema" and "tablename" as parameter when you run it from script. So you can export/import unlimited tables just by passing the parameter

Step1: hive -S -e "export table $schema_file1.$tbl_file1 to '$HDFS_DATA_PATH/$tbl_file1';" # -- Execute from HDFS.
Step2: # --It contains both data and metadata. zip it and scp to target cluster
Step3: hive -S -e "import table $schema_file1.$tbl_file1 from '$HDFS_DATA_PATH/$tbl_file1';" # -- First import will through an error as table doesn't exist, but automatically create a table
Step4: hive -S -e "import table $schema_file1.$tbl_file1 from '$HDFS_DATA_PATH/$tbl_file1';" # -- Second import will import the data without any error as table available now

Thanks

Kumar

Can we load Parquet file into Hive directly?

4 Answers