I scoop two datasets from 2 diff sources into Hive. I created a union of the two tables in hive using
create table db.table as select table 1 union select table 2
I used this table in pyspark by HiveContext to perform some analytical functions like string indexing on of the columns.
hc=HiveContext(sc)
data = hc.sql("select * from db.table")
from pyspark.sql import SQLContext, Row, HiveContext
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol="col_cat", outputCol="cat_indexed")
indexed=indexer.fit(data).transform(data)
However I get the following error
py4j.protocol.Py4JJavaError: An error occurred while calling o63.fit.
: java.io.IOException: Not a file:
So I went into HDFS
hadoop fs -ls /hive/db/table
and I found the table, I dont know whats the issue here. I feel its because I did not create an external table. but it worked last time without external.
data.first()give you anything? - shuaiyuancnhive-site.xml? - shuaiyuancn