0
votes

I'm using Spark 1.3.1 from AWS EMR. I've created a spark table using HiveContext and can see this from Hive (using "show tables"). However when I try to query the table(Select..), it throws following error:

hdfs://IP:9000/user/hive/warehouse/tablename/part-r-00001.parquet not a SequenceFile

When I use "describe tablename", it shows:

col                     array<string>           from deserializer

"Show table" returns the table name properly.

Any idea why the parquet file is not a sequenced file that Spark is generating and how to resolve that? I need to query Spark tables from Hive and using JDBC connections from RStudio or other tools.

1
SparkSQL/Hive and Parquet combined can be finicky. How are you creating the table? Given the desire to make Spark and Hive work together you may find github.com/awslabs/emr-bootstrap-actions/blob/master/spark/… helpful.ChristopherB
Setting convertMetastoreParquet to false was the trick. Thanks for the information! I'm creating Spark table using saveAsTable from json file after loading into a Data Frame. Do you know how I can access Spark table directly using JDBC? Spark 1.3.1 documentation says it works using Hive Server2, but it does not. Any idea?Dipankar Pal

1 Answers

0
votes

(Posting comment as answer) The issue is getting Spark and Hive to both use the same version of Parquet and understand how to access the same data.

An example can be seen at https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/examples/parquet-sparksql-hive-compatibility.md

A helpful property in Spark to due this is:

set spark.sql.hive.convertMetastoreParquet=false;

As for accessing Spark via JDBC, the Thrift server for Spark which is based on HiveServer2 needs to be started and ran on a different port than the existing HiveServer2 which comes with Hive. See https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server

Example command to start Thrift service for JDBC on port 10001:

/home/hadoop/spark/sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.http.port=10001 --master yarn-client

Be sure to note that the start-thriftserver.sh takes the familiar options as found with spark-submit with resource allocation (executors, cores, memory, etc) so set accordingly.