0
votes

I'm trying to produce a fully refreshed set of numbers each week, pulling from a table in hive. Right now I using this method:

SELECT
COUNT(DISTINCT case when timestamp between TO_DATE("2016-01-28") and TO_DATE("2016-01-30") then userid end) as week_1,
COUNT(DISTINCT case when timestamp between TO_DATE("2016-01-28") and TO_DATE("2016-02-06") then userid end) as week_2
FROM Data;

I'm trying to get something more along the lines of:

SELECT Month(timestamp), Week(timestamp), COUNT (DISTINCT userid) FROM Data Group By Month, Week

But my week runs Sunday to Saturday. Is there a smarter way to be doing this that works in HIVE?

Solution found:

You can simply create your own formula instead of going with pre-defined function for "week of the year" Advantage: you will be able to take any set of 7 days for a week.

In your case since you want the week should start from Sunday-Saturday we will just need the first date of sunday in a year

eg- In 2016, First Sunday is on '2016-01-03' which is 3rd of Jan'16 --assumption considering the timestamp column in the format 'yyyy-mm-dd'

SELECT
count(distinct UserId), lower(datediff(timestamp,'2016-01-03') / 7) + 1 as week_of_the_year 
FROM table.data 
where timestamp>='2016-01-03'
group by lower(datediff(timestamp,'2016-01-03') / 7) + 1; 
1

1 Answers

1
votes

I see that you need the data to be grouped by week. you can just do this :

SELECT weekofyear(to_date(timestamp)), COUNT (DISTINCT userid) FROM Data Group By weekofyear(to_date(timestamp))