Write files inside Hive table hdfs folder and make them available to be queried from Hive

Question

I am using Spark 2.2.1 which has a useful option to specify how many records I want to save in each partition of a file; this feature allows to avoid a repartition before writing a file. However, it seems this option is usable only with the FileWriter interface and not with the DataFrameWriter one: in this way the option is ignored

df.write.mode("overwrite")
  .option("maxRecordsPerFile", 10000)
 .insertInto(hive_table)

while in this way it works

df.write.option("maxRecordsPerFile", 10000)
  .mode("overwrite").orc(path_hive_table)

so I am directly writing orc files in the HiveMetastore folder of the specified table. The problem is that if I query the Hive table after the insertion, this data is not recognized by Hive. Do you know if there's a way to write directly partition files inside the hive metastore and make them available also through the Hive table?

Sandeep Das Sandeep Das · Accepted Answer · 2018-06-06T11:18:02

Debug steps :

1 . Check the type of file your hive table consumes

Show create table table_name

and check "STORED AS " .. For better efficiency saves your output in parquet and on the partition location (you can see that in "LOCATION" in above query) ..If there are any other specific types create file as that type.

2 . If you are saving data in any partition and manually creating the partition folder , avoid that .. Create partition using

alter table {table_name} add partition ({partition_column}={value});

3 .After creating the output files in spark .. You can reload those and check for "_corrupt_record" (you can print the dataframe and check this)

Write files inside Hive table hdfs folder and make them available to be queried from Hive

2 Answers