I have a dataset that I'm de-duping with the following code:
select session_id, sol_id, id, session_context_code, date
from (
select *, ROW_NUMBER() OVER (PARTITION BY session_id, sol_id, date) as rn,
substr(case_id,2,9) as id
from df.t1_data
)undup
where undup.rn =1
order by session_id, sol_id, date
I want to add a variable that stores the total count of rows after dedup, and I tried with count(*):
select session_id, sol_id, id, session_context_code, date,count(*) as total
from (
select *, ROW_NUMBER() OVER (PARTITION BY session_id, sol_id,date) as rn,
substr(case_id,2,9) as id
from df.t1_data
)undup
where undup.rn =1
order by session_id, sol_id, date
The error I received:
ERROR: Execute error: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10025]: Line 1:44 Expression not in GROUP BY key 'session_id'
I just want to output a count as a variable that counts all distinct records by session_id and sol_id after de-duped by row number. How do I incorporate that to the code?
Based on Gomz's suggestion, but received error:
ERROR: Execute error: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: ParseException line 1:614 missing EOF at 'group' near 'nifi_date'
Code:
select session_id, solicit_id, nifi_date,id, session_context_code,count(*) as total
from (
select *, ROW_NUMBER() OVER (PARTITION BY session_id, sol_id) as rn,
substr(case_id,2,9) as id
from df.t1_data
)undup
where undup.rn =1 and
session_context_code in ("4","3") and
order by session_id, sol_id, nifi_date
group by session_id, sol_id, nifi_date,id, session_context_code