I used Spark Streaming to process online requirement such as the new user counts per hour like this:
by each batch, when log comes, then select the uid from the external table such as hbase or dynamodb, if not exist, then insert the table
this approach used the table so frequently that cost too much expense.
Now I want to use structured streaming to solve this problem.
in the following sql can solve the problem offline:
sql1
create table event_min_table as select pageid,uid,floor(min(time)/36000)*3600 as event_time from event_table group by pageid,uid
sql2
select pageid,count(distinct uid) as cnt from event_min_table group by pageid,event_time
As I am not familiar with the structured streaming, structured streaming not support the multiple aggregation, so I used like this:
readStreamto create a query assql1then register as a table in memory and output mode iscompletecreate a query from the table used
sql2and output format isupdate, save to the external table like hbase or dynamodb
I don't know whether my approach can solve the problem, but I have several questions:
if I create a memory table in
completeoutput mode, the data will bigger as the time goes on?even this may worked, but the result whether output when each log come, so the question still can't solved, my goal is to decrease the request to the external table,such as hbase or dynamodb