1
votes

I want to perform an identical calculation based on daily values prior to a different time slot and in a specific time slot. My dataset consists out of hourly data of users over the year 2016, an example set of a single user is provided below.

For each day in a year I want to aggregate the positive values between before 03:00 and subtract those from the aggregate values between 03:00 and 05:00 (regardless whether these values are positive of negative)

Example data of a single user below

Time_index          value
2016-01-01 00:00:00 0.076805274
2016-01-01 01:00:00 -0.230500848            
2016-01-01 02:00:00 -0.013729424            
2016-01-01 03:00:00 -0.574499489            
2016-01-01 04:00:00 -0.987048205            
2016-01-01 05:00:00 0.172100756         
2016-01-01 06:00:00 0.495357096
"                       "
2016-01-02 00:00:00 0.714050398         
2016-01-02 01:00:00 0.205577513         
2016-01-02 02:00:00 -0.223645255            
2016-01-02 03:00:00 -0.562646915            
2016-01-02 04:00:00 -0.420385944            
2016-01-02 05:00:00 -0.548992675
2016-01-02 06:00:00 -0.732962989        


time_index <- seq(from = as.POSIXct("2016-01-01 00:00"),
              to  = as.POSIXct("2016-12-31 23:00"),
              by = "hour")

value <- runif(length(time_index), min = -1, max = 1)

example <- data.frame(time_index, value)

Based on the example data the calculation should look like this for January 1st:

Positive 00:00- 02:00          Everything 03:00 - 05:00

0.076805274         -        (-0.574499489 - 0.987048205 + 0.172100756)

How do I perform this calculation that depends on (1) two time frames that can differ in time and (2) the numbers being positive for the first instance, across all days in a year? And if this is doable, how could it be scaled up to do it for multiple users (so different columns of values).

I have tried using nested for-loops and do the calculations based on if-else statements, but I have not been able to loop over the each individual day in combination with the right timeframes. Apart from that, since nested for-loops are notoriously slow in R, I am afraid that it will take too long to do it for all users.

1

1 Answers

0
votes

Data and setup:

library(tidyverse)
library(lubridate)

set.seed(4) # for reproducibility
time_index <- seq(
  from = as.POSIXct("2016-01-01 00:00"),
  to  = as.POSIXct("2016-12-31 23:00"),
  by = "hour"
)    
value <- runif(length(time_index), min = -1, max = 1)
example <- data.frame(time_index, value)

First create an "intermediate" results dataframe -- taking note that we do not really care about hours that are not between 0 and 5.

df_intermediate <- example %>%

  mutate(
    date = as_date(time_index),
    hour = hour(time_index),
    hour_block = case_when(
      between(hour, 0, 2) ~ "block_1",
      between(hour, 3, 5) ~ "block_2",
      TRUE ~ NA_character_
    )
  ) %>% 

  filter(!is.na(hour_block)) %>% 
  group_by(date, hour_block) %>%
  nest() %>% 
  ungroup() %>%
  mutate(
    intermediate_result = if_else(
      hour_block == "block_1",
      map_dbl(data, ~ sum(.$value[.$value > 0 ])),
      map_dbl(data, ~ sum(.$value))
    )
  )

df_intermediate
#> # A tibble: 732 x 4
#>    date       hour_block data             intermediate_result
#>    <date>     <chr>      <list>                         <dbl>
#>  1 2016-01-01 block_1    <tibble [3 x 3]>              0.172 
#>  2 2016-01-01 block_2    <tibble [3 x 3]>             -0.297 
#>  3 2016-01-02 block_1    <tibble [3 x 3]>              0.960 
#>  4 2016-01-02 block_2    <tibble [3 x 3]>              0.770 
#>  5 2016-01-03 block_1    <tibble [3 x 3]>              0.576 
#>  6 2016-01-03 block_2    <tibble [3 x 3]>              1.90  
#>  7 2016-01-04 block_1    <tibble [3 x 3]>              0.650 
#>  8 2016-01-04 block_2    <tibble [3 x 3]>             -0.0962
#>  9 2016-01-05 block_1    <tibble [3 x 3]>              0.531 
#> 10 2016-01-05 block_2    <tibble [3 x 3]>              0.169 
#> # ... with 722 more rows

NOTE: You may want to look at df_intermediate %>% unnest(data) to see what's going on here.

df_intermediate %>% 
  group_by(date) %>%
  summarise(
    final_result = first(intermediate_result) - last(intermediate_result)
  )
#> # A tibble: 366 x 2
#>    date       final_result
#>    <date>            <dbl>
#>  1 2016-01-01       0.469 
#>  2 2016-01-02       0.189 
#>  3 2016-01-03      -1.32  
#>  4 2016-01-04       0.746 
#>  5 2016-01-05       0.362 
#>  6 2016-01-06       1.55  
#>  7 2016-01-07      -1.22  
#>  8 2016-01-08       0.873 
#>  9 2016-01-09      -0.0262
#> 10 2016-01-10      -0.953 
#> # ... with 356 more rows

Created on 2020-05-19 by the reprex package (v0.3.0)