2
votes

I am using Airflow hiveserver2 hook to get results from Hive table and load into CSV. The hook to_csv function has a parameter 'output_headers'. If set to true, it gets column names in the form of tablename.columnname along with data and writes to a CSV file. In the CSV header I just need the column names and need to get rid of the tablename from tablename.columnname. Can I override the parameter somehow to just get column names? Is there any other way to just retrieve column names using HiveServer2Hook?

I have connected to Hive using HiveServer2Hook. I have also executed the hooks to_csv function. I just need to change the format of the column names returned using the function. Here is the link to the hook. You can find the to_csv, get_records and get_results function under HiveServer2Hook.

https://airflow.apache.org/_modules/airflow/hooks/hive_hooks.html

I also tried running 'describe tablename' and 'show columns from tablename' for HQL but the hive hook's get_records and get_results function breaks on header issue as the result returned by 'describe' and 'show columns' is not in the expected format.

tried the following:

1) describe tablename;
2) show columns from tablename;

The airflow hook has functions get_records and get_results. Both break on following line when I use above HQL statements.

 header = next(results_iter)

Is there any other way to get column names, write to CSV and pull data using HiveServer2Hook and Python?

3

3 Answers

1
votes

I ran into the same problem and here is what I found as an easier way to do it.

Pass the below hive_conf parameter to to_csv(..) method

hive_conf={"hive.resultset.use.unique.column.names": "false"}

This will suppress the table name before column name.

0
votes

Use HiveMetastoreHook's get_table(..) function to get exact column names as follows

# imports
from airflow.hooks.hive_hooks import HiveMetastoreHook
from hmsclient.genthrift.hive_metastore import ttypes
from typing import List

# create hook
hive_metastore_hook: HiveMetastoreHook = HiveMetastoreHook(metastore_conn_id="my-hive-metastore-conn-id")

# fetch table object
table: ttypes.Table = mt_hook.get_table(table_name="my_table_name", db="my_db_name")
# determine column names
column_names: List[str] = [field_schema.name for field_schema in table.sd.cols]

..

After this you must subclass Hiveserver2Hook to modify the to_csv(..) method. In particular, changing the header value to the column_names extracted above should suffice.


Alternatively if you do not wish to subclass Hiveserver2Hook, you can just implement it's to_csv(..) separately (such as in a hive_utils.py file) and achieve the same behaviour

0
votes

Most queries that throw this error have their own function from the HiveMetastoreHook object that can provide the correct result. Most of these have to do with table and partition metadata.