0
votes

In trying to write code to implement this comment: https://stackoverflow.com/a/34521857/1730064

I've written a small script that runs the following code:

from pyspark.sql import SQLContext, HiveContext
from pyspark import SparkContext

from pyspark.sql.types import *


sc = SparkContext.getOrCreate()
hivContext = HiveContext(sc)

print sc

from pyspark.mllib.random import RandomRDDs
#df = RandomRDDs.normalRDD(sc, 1000, 10, 1).map(Tuple1(_)).toDF("x")
df = RandomRDDs.uniformRDD(sc, 1000, 10, 1).map(lambda x: (x, )).toDF()
print df.show()
df.registerTempTable("df")
hivContext.sql("SELECT percentile_approx(_1, 0.5) FROM df").show

However when I run the code using spark-submit I get the following output:

./spark-submit  bar.py 

...

Traceback (most recent call last): File "/root/bar.py", line 18, in hivContext.sql("SELECT percentile_approx(_1, 0.5) FROM df").show File "/usr/local/tmpusr/third-party/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 580, in sql File "/usr/local/tmpusr/third-party/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in call File "/usr/local/tmp/third-party/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 51, in deco pyspark.sql.utils.AnalysisException: u"missing \' at ',' near ''; line 1 pos 27" 17/12/12 18:53:15 INFO spark.SparkContext: Invoking stop() from shutdown hook

I'm having trouble figuring out if something is wrong with setting up the Hive Context or is it something with Hive I need to fix?

1

1 Answers

1
votes

The issue is the hivecontext does not recognize the column name "_1" directly, rather you should name it to something else, one way to get this right is as follows:

from pyspark.sql import SQLContext, HiveContext
from pyspark import SparkContext

from pyspark.sql.types import *


sc = SparkContext.getOrCreate()
hivContext = HiveContext(sc)

print sc

from pyspark.mllib.random import RandomRDDs
#df = RandomRDDs.normalRDD(sc, 1000, 10, 1).map(Tuple1(_)).toDF("x")
df = RandomRDDs.uniformRDD(sc, 1000, 10, 1).map(lambda x: (x, )).toDF(["var1"])
print df.show()
df.registerTempTable("df")
hivContext.sql("SELECT percentile_approx(var1, 0.5) FROM df").show()