I have a dataset that has counts by state and county and I would like to calculate the median and average by state and county such as:
Have:
ID state county count
1 MD aa 2
2 MD aa 4
3 VA bb 1
4 VA bb 2
5 VA bb 4
6 VA cc 7
7 VA cc 8
Want:
What I have so far that gives me error:
Select id, STATE,COUNTY,count,
percentile(cast(count as BIGINT), 0.5) OVER() as overall_median,
round(avg(count),2) OVER() as overall_avg,
percentile(cast(count as bigint),0.5) OVER(PARTITION BY id,STATE) as med_state,
percentile(cast(count as bigint),0.5) as med_county,
AVG(count) OVER (PARTITION BY id, STATE) as avg_state,
AVG(count) AS avg_county,
from have
group by id, state, county
Error received when not using group:
ERROR: Execute error: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies.Underlying error: org.apache.hadoop.hive.ql.parse.SemanticException: Line 1:457 Expression not in GROUP BY key 'id'
Code without group:
Select id, STATE,county,count,
percentile(cast(count as BIGINT), 0.5) OVER() as overall_median,
round(avg(count),2) OVER() as overall_avg,
percentile(cast(count as bigint),0.5) OVER(PARTITION BY id,STATE) as med_state,
percentile(cast(count as bigint),0.5) OVER(PARTITION BY id,STATE,county) as med_county,
AVG(count) OVER (PARTITION BY id, STATE) as avg_state,
AVG(count) OVER (PARTITION BY id, STATE, county) as avg_county,
from have
Thank you!
group by
i guess. – Vamsi Prabhala