1
votes

Since the upgrade from hdp 3.1.0 to 3.1.4, I have some issue in Hive I do not understand. Note that I am only using ORC transactional tables.

For instance this query:

with cte as (
  select
      e.id
    , '{}' as json
  from event e
)
-- select count(*) from cte
select 
    id
  , lv.customfield
from cte
lateral view outer
  json_tuple(cte.json, 'customfield') cv AS `customfield`

It worked perfectly before the upgrade.

Now, even if the CTE returns a certain number of rows, using the lateral view will just drop rows from the resultset, without any error, whereas there is no extra where clause outside the CTE (in my real example, the query returns 66 rows without the lateral view, but only 19 with).

In my case I have:

  • select count(*) give me 66 rows
  • when the lateral view on a static string is added, I only get 19 rows.

I tried quite a few variations:

  • if I replace the event table by a static CTE (select stack(1, ...)) I have the result I expect
  • if I remove the lateral view, I have the number of rows I expect (as long as I do not use is distinct from)
  • if instead of a CTE I create and use a temporary table, the outcome does not change.
  • if I put json_tuple(cte.json, 'customfield') in the select part outside the CTE (and nothing else as it would not be valid), without the lateral view, I have the number of expected rows,
  • If I use get_json_object in the select part outside the CTE (and no lateral view) I have the expected results.
  • of course, there is nothing in the hive (server or metastore) logs.
  • as a side note, since the upgrade a merge statement [keeps generating duplicates][1], whereas it worked perfectly before.

Another extremely surprising thing is that inside the CTE there is an if statement, for instance: if(is_deleted is null, 'true', 'false'). If I replace the is null with is not distinct from null, which should be perfectly valid, no rows are returned by the CTE.

I am completely at loss and I have no idea why this happens and how I can trust hive. 

I cannot replicate the error by generating manual data so I cannot give a (not) working example.

1

1 Answers

0
votes

The actual reason I do not understand yet, but I could isolate the problem and could actually submit a bug report: https://issues.apache.org/jira/browse/HIVE-22500

In short, a lesser than or equals with implicit string conversion to timestamp fails if a sort by (implicit or explicit) is involved.

-- valid result
select count(*) from ( select * from opens where load_ts <= '2019-11-13 09:07:00') t;
-- invalid result
select count(*) from ( select * from opens where load_ts <= '2019-11-13 09:07:00' sort by id) t;

You can see the bug report for full set up or other examples. The workaround is to explicitly cast the string to a timestamp.