Have been trying to get an accurate view of how Spark's catalog API stores the metadata.
I have found some resources, but no answer:
- https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-Catalog.html
- https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-CatalogImpl.html
- https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/catalog/Catalog.html
I see some tutorials that take for granted the existence of Hive Metastore.
- Is Hive Metastore potentially included with Spark distribution?
- Spark cluster can be short-lived, but Hive metastore would obviously need to be long-lived
Apart from the catalog feature, partitioning and sorting features when writing out a DF seem to depend on Hive... So "everyone" seems to take Hive as granted when talking about key Spark features of persisting a DF.