2
votes

I'm having an issue writing a Hive table from Spark. The following code works just fine; I can write the table (which defaults to the Parquet format) and read it back in Hive:

df.write.mode('overwrite').saveAsTable("db.table")

hive> describe table;
OK
val           string
Time taken: 0.021 seconds, Fetched: 1 row(s)

However, if I specify the format should be csv:

df.write.mode('overwrite').format('csv').saveAsTable("db.table")

then I can save the table, but Hive doesn't recognize the schema:

hive> describe table;
OK
col                     array<string>           from deserializer
Time taken: 0.02 seconds, Fetched: 1 row(s)

It's also worth noting that I can create a Hive table manually and then insertInto it:

spark.sql("create table db.table(val string)")
df.select('val').write.mode("overwrite").insertInto("db.table")

Doing so, Hive seems to recognize the schema. But that's clunky and I can't figure a way to automate the schema string anyway.

3
What is the schema of the Dataframe? Are there commas in the data? - OneCricketeer
What is the output of df.printSchema? - Ankush Singh
Why "clunky"? You want to create a Hive-compliant dataset, so it seems reasonable to use a Hive-compliant method to create it. And for the record, it's even worse with Parquet, because by default Spark does not use the same binary encoding as Hive... and the way to override that default is not documented cf. [SPARK-20937] - Samson Scharfrichter
@cricket_007 and Ankush Singh: The schema and data are trivial and I've tried various. Even a DataFrame with a single column that is an integer will fail to write a schema that Hive can read (using the csv format). - santon
@samsonScharfrichter I say "clunky" because you have to write a hive-compliant schema string. When using saveAsTable with the default format, you don't have to do that. I hadn't run into the binary encoding issues with Parquet. Thanks for the heads up! - santon

3 Answers

1
votes

That is because Hive SerDe do not support csv by default.

If you insist on using csv format, creating table as below:

CREATE TABLE my_table(a string, b string, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   "separatorChar" = "\t",
   "quoteChar"     = "'",
   "escapeChar"    = "\\"
)  
STORED AS TEXTFILE;

And insert data through df.write.insertInto

For more info:

https://cwiki.apache.org/confluence/display/Hive/CSV+Serde

0
votes

You are creating a table with text format and trying to insert CSV data into it, which may run in to problems. So as suggested in the answer by Zhang Tong, create the hive table using hive OpenCSVSerde.

After that, if you are more comfortable with Hive query language than dataframes, you can try this.

df.registerTempTable("temp")
spark.sql("insert overwrite db.table select * from temp")
0
votes

This happens because HiveSerde is different for csv than what is used by Spark. Hive by default use TEXTFORMAT and the delimiter has to be specified while creating the table.

One Option is to use the insertInto API instead of saveAsTable while writing from spark. While using insertInto, Spark writes the contents of the Dataframe to the specified table. But it requires the schema of the dataframe to be same as the schema of the table. Position of the columns is important here as it ignores the column names.

Seq((5, 6)).toDF("a", "b").write.insertInto("t1")