0
votes

I am analyzing memory usage of our spark application. We use Hive and PySpark

In our application, we have many SparkSQL queries like the following ones. When running the queries, our hivemeta store server is under a lot of pressure and running out of memory.

The storage of disk-cached claim_temp will explode although I cannot find any additional cache() statement. Just select the data, with some columns in the select result and insert the result. (Claim_temp is about 300 GB and will grow into 1000GB)

 SQL4 = """
                create temp view EX as
                select a.* from {0} a
                inner join {1} b
                on a.specialty = b.code
                where classification = 'ABCD'
                """.format(self.tables['Claims'],self.tables['taxonomy'])
  self.spark.sql(SQL4)

self.spark.sql("""insert into {0}.results_flagged
                select * from EX """.format()

Does the create temp view statement add data to Hive Metastore?

Is create temp view a Hive SQL which will be treated as temp table in Hive or just a replacement of createOrReplaceTempView which does not add any memory?

1

1 Answers

0
votes

temp view will not persist to metastore ... it is an object associated with the spark session application and are killed when the application ends ... details here => https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-view.html