0
votes

I have 4 data frames, each data frame corresponding to a single year. Each data frame contains daily rainfall for five locations.

Generate sample data

    location <- c("A","B","C","D","E")
    mat <- round(as.data.frame(matrix(runif(1825),nrow=5,ncol=365)), digits=2)
    dat.1981 <-as.data.frame(cbind(location,mat)) # rainfall for 1981
    dat.1981$year <- 1981

    mat <- round(as.data.frame(matrix(runif(1825),nrow=5,ncol=365)), digits = 2)
    dat.1982 <- as.data.frame(cbind(location,mat)) # rainfall for 1982
    dat.1982$year <- 1982

    mat <- round(as.data.frame(matrix(runif(1825),nrow=5,ncol=365)), digits = 2)
    dat.1983 <-as.data.frame(cbind(location,mat)) # rainfall for 1983
    dat.1983$year <- 1983

    mat <- round(as.data.frame(matrix(runif(1825),nrow=5,ncol=365)), digits = 2)
    dat.1984 <-as.data.frame(cbind(location,mat)) # rainfall for 1984
    dat.1984$year <- 1984

    dat <- as.data.frame(rbind(dat.1981,dat.1982,dat.1983,dat.1984))

For each year, I want to classify whether a day was an extreme wet day or not

Here's how I do my calculation:

1) For each location, generate the mean and sd of rainfall for every week for the period 1981 to 1984. For example, in location A the mean rainfall for the first week will be:

(First week rain 1981 in A + First week rain 1982 in A + First week rain 1983 in A + First week rain 1984 in A)/4

which can be written in R as

    mean.week1.loc1 <- mean(rowSums(dat[dat$location=="A",2:8])) # 2:8 selects the first 7 days in each year
    sd.week1.loc1 <- sd(rowSums(dat[dat$location=="A",2:8])) 

    wet.cr <- mean.week1 + sd.week1 # this is my threshold for defining a wet day

If daily rainfall in week 1 for the years 1981 to 1984 in location A is greater than wet.cr, that day is a wet day and hence gets a value of 1

As an example, to examine whether rainfall of week 1 for location A for 1981 to 1984 is a wet day or not I can do the following:

   lapply(dat[, 2:8], function(x) ifelse(x > wet.cr, 1, 0))

I want to repeat this for each week and each location.

However, I am unable to stitch these individual functions together and also my final results should be a dataframe same as dat but instead of rainfall values, I will have 1 or 0 defining whether it is a wet day or not.

EDIT

The solutions below gives me the following:

mean(c(rainfall 1981 day 1 week 1, ...., rainfall 1981 day 7 week 1, rainfall 1982 day 1 week 1,....,rainfall 1982 day 7 week 1,....,rainfall 1984 day 1 week 1,....,rainfall 1984 day 7 week 1))

WHAT I WANT IS:

mean(c(mean(total rainfall week 1 1981), mean(total rainfall week 1 1982), mean(total rainfall week 1 1983), mean(total rainfall week 1 1984)))

I hope this is clear now.

3
This seems harder than it needs to be. If you have everything by date, use ISOweek to grab the year and week, then just aggregate over year in tidyr. If you give an example of the data with full dates instead of broken down in columns by week, I can show you....sconfluentus
I doubt that this kind of data is stored as shown, since not every year has 365 days. If you had dates, you could transform them to long, add a column which denotes week of year, group by that and get the mean and sd which you can compare the value to... Does the data exist as I expect or is it really as shown?Tino
Notably, 1984 has 366 days. Which one is missing from your data frame?Gregor Thomas
I have deleted the extra days in leap years so that all years have the same 365 days89_Simple
@Tino this is how I have the data as. I can try to manipulate it in the way that you suggested.89_Simple

3 Answers

2
votes

A tidyverse solution

    library(magrittr)
    library(tidyverse)

    dat_m <- gather(dat, day, rainfall, -location, -year)
    str(dat_m)

    dat_m %<>%
      mutate(day = gsub("V", "", day)) %>%
      mutate(day = as.numeric(day)) %>% 
      mutate(week = as.integer(ceiling(day/7))) %>% 
      group_by(location, week) %>% 
      mutate(wet.cr = mean(rainfall, na.rm = TRUE) + sd(rainfall, na.rm = TRUE) ) %>% 
      mutate(indication = ifelse(rainfall > wet.cr, 1, 0)) %>% 
      ungroup()
    dat_m 

    # A tibble: 7,300 x 7
       location  year   day rainfall  week wet.cr indication
       <fctr>   <dbl> <dbl>    <dbl> <int>  <dbl>      <dbl>
     1 A         1981  1.00    0.880     1  0.845       1.00
     2 B         1981  1.00    0.850     1  0.829       1.00
     3 C         1981  1.00    1.00      1  0.877       1.00
     4 D         1981  1.00    0.100     1  0.755       0   
     5 E         1981  1.00    0.190     1  0.750       0   
     6 A         1982  1.00    0.380     1  0.845       0   
     7 B         1982  1.00    0.760     1  0.829       0   
     8 C         1982  1.00    0.940     1  0.877       1.00
     9 D         1982  1.00    0.900     1  0.755       1.00
    10 E         1982  1.00    0.600     1  0.750       0   
    # ... with 7,290 more rows

Edit: For rainfall, I think it's better to use sum (total) than mean

So we first calculate the total weekly rainfall for every year. Then we calculate the long-term average & stdev of the total weekly rainfall.

    dat_m %<>%
      mutate(day = as.numeric(gsub("V", "", day)),
             week = as.integer(ceiling(day/7))) %>%
      group_by(location, week, year) %>% 
      mutate(total_weekly_rainfall = sum(rainfall, na.rm = TRUE)) %>% 
      ungroup() %>% 
      group_by(location, week) %>% 
      mutate(mean_weekly_rainfall = sum(rainfall, na.rm = TRUE)/length(unique(year)),
             stddev_weekly_rainfall = sd(rainfall, na.rm = TRUE),
             wet.cr =  mean_weekly_rainfall + stddev_weekly_rainfall,
             indication = ifelse(total_weekly_rainfall > wet.cr, 1, 0)) %>% 
      arrange(location, year, day) %>% 
      ungroup() %>% 
      distinct(location, year, week, .keep_all = TRUE)
    dat_m 

    # A tibble: 1,060 x 10
       location  year   day rainfall  week total_wee~ mean_wee~ stddev_w~ wet.~ indic~
       <fctr>   <dbl> <dbl>    <dbl> <int>      <dbl>     <dbl>     <dbl> <dbl>  <dbl>
     1 A         1981  1.00   0.880      1     0.880      0.630     0.277 0.907      0
     2 A         1981  8.00   0.190      2     0.190      0.330     0.431 0.761      0
     3 A         1981 15.0    0.630      3     0.630      0.548     0.331 0.878      0
     4 A         1981 22.0    0.0300     4     0.0300     0.290     0.259 0.549      0
     5 A         1981 29.0    0.360      5     0.360      0.308     0.196 0.504      0
     6 A         1981 36.0    0.540      6     0.540      0.500     0.225 0.725      0
     7 A         1981 43.0    0.0300     7     0.0300     0.375     0.289 0.664      0
     8 A         1981 50.0    0.170      8     0.170      0.332     0.375 0.708      0
     9 A         1981 57.0    0.260      9     0.260      0.652     0.272 0.924      0
    10 A         1981 64.0    0.590     10     0.590      0.512     0.202 0.715      0
    # ... with 1,050 more rows
1
votes

using data.table :

library(data.table)
dat <- setDT(dat)
newdat <- melt(dat, measure.vars = patterns("^V"),variable.name = "day",value.name = "rain")
newdat[,day := as.character(day)]
newdat[,day := as.numeric(unlist(lapply(newdat$day,function(x){strsplit(x,"V")[[1]][2]})))]
newdat[,Week := day %/% 7]
newdat[,threshold := mean(rain) + sd(rain),  by = .(location,Week)]
newdat[,wet := ifelse(rain > threshold,1,0)]
print(newdat,topn = 100)


      location year day rain Week threshold wet
   1:        A 1981   1 0.73    0 0.7630065   0
   2:        B 1981   1 0.69    0 0.8599243   0
   3:        C 1981   1 0.45    0 0.8145956   0
   4:        D 1981   1 0.51    0 0.8935058   0
   5:        E 1981   1 0.77    0 0.6992752   1
   6:        A 1982   1 0.47    0 0.7630065   0
   7:        B 1982   1 0.70    0 0.8599243   0
   8:        C 1982   1 0.48    0 0.8145956   0
   9:        D 1982   1 0.92    0 0.8935058   1

a step by step explanation: first you need to change format of your data to ease the calculus. The long format is more appropriate, as each column V## is actually a variable that is the number day. This is done using melt

melt(dat, measure.vars = patterns("^V"),variable.name = "day",value.name = "rain")

     location year  day rain
   1:        A 1981   V1 0.73
   2:        B 1981   V1 0.69
   3:        C 1981   V1 0.45
   4:        D 1981   V1 0.51
   5:        E 1981   V1 0.77
  ---                        
7296:        A 1984 V365 0.31
7297:        B 1984 V365 0.99
7298:        C 1984 V365 0.25
7299:        D 1984 V365 0.24
7300:        E 1984 V365 0.87

Then you tranform your day to a real number, in order to be able to calculate the week

newdat[,day := as.character(day)]
newdat[,day := as.numeric(unlist(lapply(newdat$day,function(x){strsplit(x,"V")[[1]][2]})))]
> newdat[,.(day,year)]
      day year
   1:   1 1981
   2:   1 1981
   3:   1 1981
   4:   1 1981
   5:   1 1981

Then calculate the week number the same as you do

newdat[,Week := day %/% 7]

The statistic for the calculus of the sthreshold is done by grouping on the weeks and places (so statistic over the year for each place)

newdat[,threshold := mean(rain) + sd(rain), by = .(location,Week)]

and define your wet day as the day when rain is higher than the threshold

newdat[,wet := ifelse(rain > threshold,1,0)]

but I agree on the comment that the initial data was surely in a better format than what you are presenting.

0
votes

For both the data.table and the tidyverse solutions you may be well served to treat this as a scaling exercise (a z score in many disciplines), since the mean + n standard deviation is a well known benchmark.

For the data.table solution you would:

newdat[,zrain := scale(rain),  by = .(location,Week)]
newdat[,zwet := ifelse(zrain > 1.0,1,0)]

where you rely on scale from base and compare to 1.0

For the tidyverse that becomes:

mutate(zrain = scale(rainfall)) %>% 
mutate(indication = ifelse(zrain > 1.0, 1, 0)) %>% 

That way in the future if your standard for what "wet" is changes you only have to change one number in your code