ExecuteSQL processor returns corrupted data

Question

I have a flow in NiFI in which I use the ExecuteSQL processor to get a whole a merge of sub-partitions named dt from a hive table. For example: My table is partitioned by sikid and dt. So I have under sikid=1, dt=1000, and under sikid=2, dt=1000. What I did is select * from my_table where dt=1000.

Unfortunately, what I've got in return from the ExecuteSQL processor is corrupted data, including rows that have dt=NULL while the original table does not have even one row with dt=NULL.

The DBCPConnectionPool is configured to use HiveJDBC4 jar. Later I tried using the compatible jar according to the CDH release, didn't fix it either.

The ExecuteSQL processor is configured as such:

Normalize Table/Column Names: true

Use Avro Logical Types: false

Hive version: 1.1.0

CDH: 5.7.1

Any ideas what's happening? Thanks!

EDIT: Apparently my returned data includes extra rows... a few thousand of them.. which is quite weird.

quite sure that's a problem from jdbc driver. could you check what result set returned by other jdbc-based tools. for example you could take squirrel-sql and try execute the same query. — daggett
You might also want to try SelectHiveQL instead of ExecuteSQL, the former has the Hive 1.2.1 driver included already, the latter doesn't "officially" support Hive, even with a third-party driver — mattyb
@daggett good point, I didn't try to change driver. I'm currently using HiveJDBC4 but I'll try a different version. — Antil Karev
@mattyb Actually I've tried using hive-nar at first but I've had problems using it regarding my authentication method in my hadoop cluster and hive. So I rather use the ExecuteSQL processor so it would be generic too — Antil Karev

mattyb mattyb · Accepted Answer · 2018-06-12T16:00:56

Does HiveJDBC4 (I assume the Simba Hive driver) parse the table name off the column names? This was one place there was an incompatibility with the Apache Hive JDBC driver, it didn't support getTableName() so doesn't work with ExecuteSQL, and even if it did, when the column names are retrieved from the ResultSetMetaData, they had the table names prepended with a period . separator. This is some of the custom code that is in HiveJdbcCommon (used by SelectHiveQL) vs JdbcCommon (used by ExecuteSQL).

If you're trying to use ExecuteSQL because you had trouble with the authentication method, how is that alleviated with the Simba driver? Do you specify auth information on the JDBC URL rather than in a hive-site.xml file for example? If you ask your auth question (using SelectHiveQL) as a separate SO question and link to it here, I will do my best to help out on that front and get you past this.

ExecuteSQL processor returns corrupted data

2 Answers