2
votes

I am trying to understand the spark DataFrame API method called saveAsTable.

I have following question

  • If I simply write a dataframe using saveAsTable API df7.write.saveAsTable("t1"), (assuming t1 did not exist earlier), will the newly created table be a hive table which can be read outside spark using Hive QL ?
  • Does spark also create some non-hive table (which are created using saveAsTable API but can not be read outside spark using HiveQL)?
  • How can check if a table is Hive Table or Non-Hive table ?

(I am new to big data processing, so pardon me if question is not phrased properly)

2

2 Answers

0
votes

Yes. Newly created table will be hive table and can be queried from Hive CLI(Only if the DataFrame is created from single input HDFS path i.e. from non-partitioned single input HDFS path).

Below is the documentation comment in DataFrameWriter.scala class. Documentation link

When the DataFrame is created from a non-partitioned HadoopFsRelation with a single input path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and Parquet), the table is persisted in a Hive compatible format, which means other systems like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL specific format.

0
votes

Yes, you can do. You table can be partitioned by a column, but can not use bucketing (its a problem between spark and hive).