1
votes

I get a data table from a server that shows price predictions depending on the selected month of a calendar year. Basically, data is downloaded from every month of the year. Here is an example data table:

set.seed(123)
dt.data <- data.table(Date = seq(as.Date('2020-01-01'), by = '1 day', length.out = 365),
                      'BRN Jan-2021' = rnorm(365, 2, 1), 'BRN Jan-2022' = rnorm(365, 2, 1),
                      'BRN Feb-2021' = rnorm(365, 2, 1), 'BRN Feb-2022' = rnorm(365, 2, 1),
                      'BRN Mar-2021' = rnorm(365, 2, 1), 'BRN Mar-2022' = rnorm(365, 2, 1),
                      'BRN Apr-2021' = rnorm(365, 2, 1), 'BRN Apr-2022' = rnorm(365, 2, 1),
                      'BRN May-2021' = rnorm(365, 2, 1), 'BRN May-2022' = rnorm(365, 2, 1),
                      'BRN Jun-2021' = rnorm(365, 2, 1), 'BRN Jun-2022' = rnorm(365, 2, 1),
                      'BRN Jul-2021' = rnorm(365, 2, 1), 'BRN Jul-2022' = rnorm(365, 2, 1),
                      'BRN Aug-2021' = rnorm(365, 2, 1), 'BRN Aug-2022' = rnorm(365, 2, 1),
                      'BRN Sep-2021' = rnorm(365, 2, 1), 'BRN Sep-2022' = rnorm(365, 2, 1),
                      'BRN Oct-2021' = rnorm(365, 2, 1), 'BRN Oct-2022' = rnorm(365, 2, 1),
                      'BRN Nov-2021' = rnorm(365, 2, 1), 'BRN Nov-2022' = rnorm(365, 2, 1),
                      'BRN Dec-2021' = rnorm(365, 2, 1), 'BRN Dec-2022' = rnorm(365, 2, 1),
                      check.names = FALSE)

This data table is quite small as I only created data for the years 2021 and 2022. But there can be several calendar years, or just one calendar year.

Now I would like to calculate daily mean values (based on the date column) for the year 2021 (i.e. the sum of all 12 values per day / date divided by 12 = number of months per calendar year) and save them in a new data table as a column. And now of course the same for 2022.

In this case, the new data table should have the following columns:

| Date | BRN Cal-2021 | BRN Cal-2022 |

where the date column remains unchanged.

The calculation and the column designation for the new data table should always be variable (depending on how many calendar years appear in dt.data). Basically, it might make sense to organize dt.data by calendar year at the beginning. But actually I don't really know how to keep the average calculation (daily) variable and general? Or maybe you should create an extra data table for each calendar year, then calculate the mean values and then merge the columns with the daily mean values back into a common data table? However, this should always remain automated (depending on how many calendar years there are). Unfortunately I have no idea how that could be done.

I hope I was able to ask my question accurately enough and someone can help me with my problem.

1

1 Answers

2
votes

Yes, it would be better to get data in separate columns for each year. We can use pivot_longer for that and create new column based on the pattern in the column names. Once we get that we can just take mean for each Date.

library(dplyr)

dt.data %>%
  tidyr::pivot_longer(cols = -Date, 
               names_to = c('month', '.value'), 
               names_pattern = c('(.*)-(\\d+)')) %>%
  group_by(Date) %>%
  summarise(across(c(matches('^\\d+$')), mean, na.rm  =TRUE))

A base R option without getting the data in long format would be to use split.default. We split the data based on year mentioned in the column names and take rowwise mean in each list.

result <- cbind(dt.data[, 1], sapply(split.default(dt.data[, -1], 
      sub('.*-', '', names(dt.data)[-1])), rowMeans, na.rm = TRUE))
names(result)[-1] <- paste0('BRN_Cal-', names(result)[-1])

#           Date BRN_Cal-2021 BRN_Cal-2022
#  1: 2020-01-01     1.974847     2.272833
#  2: 2020-01-02     2.241470     2.399902
#  3: 2020-01-03     1.988883     2.372697
#  4: 2020-01-04     2.057867     2.084504
#  5: 2020-01-05     2.012305     2.049808
# ---                                     
#361: 2020-12-26     2.038167     2.161655
#362: 2020-12-27     2.308974     2.215492
#363: 2020-12-28     2.001359     2.552923
#364: 2020-12-29     2.086283     1.773254
#365: 2020-12-30     1.802871     2.107373