0
votes

I am using Spark for querying Hive followed by transformations. My Scala app creates multiple Spark Applications. A new spark app is created only after closing SparkSession and Spark Context of the previous Spark App.

However, on stopping sc and spark, somehow connections to Hive Metastore (Mysql) are not destroyed properly. For every, Spark App I can see around 5 Mysql connections being created (old connections being still active!). Eventually, Mysql starts rejecting new connections after 150 open connections. How can I force spark to close Hive metastore connections to Mysql (after spark.stop() and sc.stop())?

Note: I have used Spark 2.1.1. I am using Spark's Thriftserver instead of HiveServer. So, I don't think I have used Hive Metastore service.

1
1. You can raise that 150-connections limit in MySQL. 2. Spark does not connect to MySQL directly; it connects to the Metastore service, which connects to its relational DB via a connection pool (either BoneCP or DBCP, if I remember well). 3. there is a known bug about connection leaks (hence memory leaks) in the Metastore code when used with BoneCP and MySQL, that was corrected only recently, so check your Hive version and Hive setup. - Samson Scharfrichter
I can raise 150 connections limit, but I don't think leaving open connections is a good approach. I wanted to fix it properly. - rohitsd

1 Answers

0
votes

I have had a similar problem, with a hive 3.1.1 metastore backed by MySQL. I'm using the wait_timeout variable to reap connections inactive for more than 10 minutes. The default is 8 hours.

https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_wait_timeout

This doesn't feel like a 'proper' solution, but it allows our system to function.