pyspark, how to read Hive tables with SQLContext?

Question

I am new to the Hadoop ecosystem and I am still confused with few things. I am using Spark 1.6.0 (Hive 1.1.0-cdh5.8.0, Hadoop 2.6.0-cdh5.8.0)

I have some Hive table that exist and I can do some SQL queries using HUE web interface with Hive (map reduce) and Impala (mpp).

I am now using pySpark (I think behind this is pyspark-shell) and I wanted to understand and test HiveContext and SQLContext. There are many thready that discussed the differences between the two and for various version of Spark.

With Hive context, I have no issue to query the Hive tables:

from pyspark.sql import HiveContext
mysqlContext = HiveContext(sc) 
FromHive = mysqlContext.sql("select * from table.mytable")
FromHive.count()
320

So far so good. Since SQLContext is subset of HiveContext, I was thinking that a basic SQL select should work:

from pyspark.sql import SQLContext
sqlSparkContext = SQLContext(sc) 
FromSQL = mysqlContext.sql("select * from table.mytable")
FromSQL.count()

Py4JJavaError: An error occurred while calling o81.sql.
: org.apache.spark.sql.AnalysisException: Table not found: `table`.`mytable`;

I added the hive-site.xml to pyspark-shell. When running

sc._conf.getAll(

I see:

('spark.yarn.dist.files', '/etc/hive/conf/hive-site.xml'),

My questions are:

Can I acess Hive table with SQLContext for simple queries (I know HiveContext is more powerfull but for me this is just to understand things)
If this is possible what is missing ? I couldn't find any info apart from the hive-site.xml that I tried but doesn't seems to work

Thanks a lot

Cheers

Fabien

HiveContext is an instance of the Spark SQL execution engine and not the other way around — eliasah
I meant to say this hivecontext is an extension of sqlcontext. The answer given is correct. — eliasah

Chitral Verma Chitral Verma · Accepted Answer · 2017-06-12T01:54:54

As mentioned in other answer, you can't use SQLContext to access Hive tables, they've given a seperate HiveContext in Spark 1.x.x which is basically an extension of SQLContext.

Reason::

Hive uses an external metastore to keep all the metadata, for example the information about db and tables. This metastore can be configured to be kept in MySQL etc. Default is derby. This done so that all the users accessing Hive may see all the contents facilitated by metastore. Derby creates a private metastore as a directory metastore_db in the directory from where the spark app is executed. Since this metastore is private, what ever you create or edit in this session, will not be accessible to anyone else. SQLContext basically facilitates a connection to a private metastore.

Needless to say, in Spark 2.x.x they've merged the two into SparkSession which acts as a singular entry point to spark. You can enable Hive support while creating SparkSession by .enableHiveSupport()

pyspark, how to read Hive tables with SQLContext?

4 Answers