0
votes

I scoop two datasets from 2 diff sources into Hive. I created a union of the two tables in hive using

create table db.table as select table 1 union select table 2

I used this table in pyspark by HiveContext to perform some analytical functions like string indexing on of the columns.

hc=HiveContext(sc)
data = hc.sql("select * from db.table")
from pyspark.sql import SQLContext, Row, HiveContext
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol="col_cat", outputCol="cat_indexed")
indexed=indexer.fit(data).transform(data)

However I get the following error

py4j.protocol.Py4JJavaError: An error occurred while calling o63.fit.
: java.io.IOException: Not a file: 

So I went into HDFS

hadoop fs -ls /hive/db/table

and I found the table, I dont know whats the issue here. I feel its because I did not create an external table. but it worked last time without external.

1
Does data.first() give you anything? - shuaiyuancn
No it gives me the same error - Shweta Kamble
Have you tried to load the table in spark? - shuaiyuancn
Also, have you configured spark to use hive-site.xml? - shuaiyuancn
I havent tried loading the table from spark, I directly created a union in hive and accessed it through HiveContext in spark, but now I think the table is not loading at all as data.show() is also giving an error. - Shweta Kamble

1 Answers

0
votes

OK so I found a fix, I moved the files from the directories i.e from

/hive/db/table/file

to

/hive/db/file

by doing

Hadoop fs -mv /hive/db/table/file /hive/db/file

and now it works, the problem was that union in Hive created a partition between the tables and hence created additional directories to save the files. so when Spark tried to access them it pointed to the directories. So I changed the file location to where spark was pointing.