I'm trying to produce a fully refreshed set of numbers each week, pulling from a table in hive. Right now I using this method:
SELECT
COUNT(DISTINCT case when timestamp between TO_DATE("2016-01-28") and TO_DATE("2016-01-30") then userid end) as week_1,
COUNT(DISTINCT case when timestamp between TO_DATE("2016-01-28") and TO_DATE("2016-02-06") then userid end) as week_2
FROM Data;
I'm trying to get something more along the lines of:
SELECT Month(timestamp), Week(timestamp), COUNT (DISTINCT userid) FROM Data Group By Month, Week
But my week runs Sunday to Saturday. Is there a smarter way to be doing this that works in HIVE?
Solution found:
You can simply create your own formula instead of going with pre-defined function for "week of the year" Advantage: you will be able to take any set of 7 days for a week.
In your case since you want the week should start from Sunday-Saturday we will just need the first date of sunday in a year
eg- In 2016, First Sunday is on '2016-01-03' which is 3rd of Jan'16 --assumption considering the timestamp column in the format 'yyyy-mm-dd'
SELECT
count(distinct UserId), lower(datediff(timestamp,'2016-01-03') / 7) + 1 as week_of_the_year
FROM table.data
where timestamp>='2016-01-03'
group by lower(datediff(timestamp,'2016-01-03') / 7) + 1;