I am planning to save the spark dataframe into hive tables so i can query them and extract latitude and longitude from them since Spark dataframe aren't iterable.
With pyspark in jupyter i wrote this code to make a spark session:
import findspark
findspark.init()
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
#readmultiple csv with pyspark
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.sql.catalogImplementation=hive").enableHiveSupport() \
.getOrCreate()
df = spark.read.csv("Desktop/train/train.csv",header=True);
Pickup_locations=df.select("pickup_datetime","Pickup_latitude",
"Pickup_longitude")
print(Pickup_locations.count())
then i run the hiveql :
df.createOrReplaceTempView("mytempTable")
spark.sql("create table hive_table as select * from mytempTable");
And i get this error:
Py4JJavaError: An error occurred while calling o24.sql.
: org.apache.spark.sql.AnalysisException: Hive support is required to CREATE Hive TABLE (AS SELECT);;
'CreateTable `hive_table`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, ErrorIfExists
+- Project [id#311, vendor_id#312, pickup_datetime#313, dropoff_datetime#314, passenger_count#315, pickup_longitude#316, pickup_latitude#317, dropoff_longitude#318, dropoff_latitude#319, store_and_fwd_flag#320, trip_duration#321]