Merge between two files in terms of date and postcode with calculating the day-weighted average for the air pollution data in R

Question

I have two data frames. One called Pollution. The second called Med. An example of both data frames as follows:

Pollution

year    month    postcode    PM10    NO2
2001      1       12345       40      20 
2001      2       12345       30      25
2001      3       12345       35      25
2001      4       12345       30      20
2001      1       62346       40      20 
2001      2       62346       30      25
2001      3       62346       35      25
2001      4       62346       30      20
2002      1       12345       44      24 
2002      2       12345       36      26
2002      3       12345       30      20
2002      4       12345       32      22
2002      1       62346       48      28 
2002      2       62346       20      35
2002      3       62346       89      101
2002      4       62346       37      27


Med

ID      postcode     trementDate_start      treatmentDate_end
1       12345         2001-01-15            2001-03-16
2       62346         2001-01-15            2001-02-16
3       12345         2002-02-21            2002-03-16
4       12345         2002-02-15            2002-04-16
4       62346         2002-03-16            2002-04-30

The idea is to link the (pollution) data with the (Med) data by the date and postcode.

To do that, I need to calculate an average pollution level based upon the number of days of exposure at a particular level (PM10, NO2).

First creating a column called num_of_day to calculate the day length in each month for the periods between the start and end date of treatment in the medical data frame. The subtracting idea between the end date to start date has been found not precise.

For Example ( I will take ID number 1 with a postcode of 12345 just for an explanation of how I calculated the pollution average for PM10 and NO2 by putting into consideration the days)

Med  
ID      postcode     trementDate_start      treatmentDate_end
1       12345         2001-01-15            2001-03-16


Pollution
year    month    postcode    PM10    NO2
2001      1       12345       40      20 
2001      2       12345       30      25
2001      3       12345       35      25

The air pollution of PM10 and NO2 values for the periods between 2001-01-15 and 2001-03-16 will be as follow:

The trementDate_start (2001-01-15) its PM10 = 40 and NO2 = 20.
the periods in between (2001-02-00) its PM10 = 30 and NO2 = 25.
The trementDate_end (2001-03-16) its PM10 = 35 and NO2 = 25.

I have to then calculate the day of exposure for those periods each:

the trementDate_start (2001-01-15) [January have total of 31 days] 15/31 = 0.48 days of exposure
the periods in between (2001-02-00) [February have a total of 29 days] this should remain the same PM10 and NO2 values because the pollution measurements in the file are on monthly basis. So, it will be: 29/29 = 1 days of exposure
the trementDate_end (2001-03-16) [March have total of 31 days] 16/31 = 0.51 days of exposure

Then I can calculate afterwards the pollution average based on the exposure days:

the trementDate_start (2001-01-15) exposure days 0.48 * 40 = 19.2(for the PM10) and 0.48 * 20 = 9.6 (for the NO2)
the periods in between (2001-02-00): 1 * 30 = 30 for PM10 and 1* 25 = 25 for NO2
the trementDate_end (2001-03-16): 0.51 * 35 = 17.85 for PM10 and 0.51 * 25 = 12.75 for NO2

Then add the PM10 together (19.2 + 30 + 17.85 = 67.05). Then I will divide 67.05 by 3 months ( 3 month is the period were the person get exposed to the air pollution during his first treatment), which is equal to 22.35

The output should be like below:

ID  postcode  trementDate_start    treatmentDate_end.    PM10.   NO2
1    12345     2001-01-15             2001-03-16         22.35  15.78

zoowalk, created the code below based on my previous requirement before I updated the information with the precise day thing. It worked perfectly.

I saw this post. stackoverflow.com/questions/15569333/…. I think this can short the idea of calculating the precise days that I explained above, which takes into account the fact that not all months and years have the same number of days, e.g., the leap year. Still cant figure out how to put them in a code with the other points looking up for the postcode and year and month.

I would appreciate extra help with this. I see it as too complex for me.

zoowalk zoowalk · Accepted Answer · 2020-12-01T19:12:33

Does this help? I am not sure I fully understand for what kind of average you are looking for. Why is 70 / 32 days = 40.93?

library(tidyverse)


pollution <- data.frame(
          year = c(2001L,2001L,2001L,2001L,2001L,
                   2001L,2001L,2001L,2002L,2002L,2002L,2002L,2002L,2002L,
                   2002L,2002L),
         month = c(1L,2L,3L,4L,1L,2L,3L,4L,1L,2L,
                   3L,4L,1L,2L,3L,4L),
      postcode = c(12345L,12345L,12345L,12345L,62346L,
                   62346L,62346L,62346L,12345L,12345L,12345L,12345L,
                   62346L,62346L,62346L,62346L),
          PM10 = c(40L,30L,35L,30L,40L,30L,35L,30L,
                   44L,36L,30L,32L,48L,20L,89L,37L),
           NO2 = c(20L,25L,25L,20L,20L,25L,25L,20L,
                   24L,26L,20L,22L,28L,35L,101L,27L)
  ) %>% 
  mutate(date_floor=paste(year,month, 01, sep="-") %>% 
           lubridate::ymd())
pollution
#>    year month postcode PM10 NO2 date_floor
#> 1  2001     1    12345   40  20 2001-01-01
#> 2  2001     2    12345   30  25 2001-02-01
#> 3  2001     3    12345   35  25 2001-03-01
#> 4  2001     4    12345   30  20 2001-04-01
#> 5  2001     1    62346   40  20 2001-01-01
#> 6  2001     2    62346   30  25 2001-02-01
#> 7  2001     3    62346   35  25 2001-03-01
#> 8  2001     4    62346   30  20 2001-04-01
#> 9  2002     1    12345   44  24 2002-01-01
#> 10 2002     2    12345   36  26 2002-02-01
#> 11 2002     3    12345   30  20 2002-03-01
#> 12 2002     4    12345   32  22 2002-04-01
#> 13 2002     1    62346   48  28 2002-01-01
#> 14 2002     2    62346   20  35 2002-02-01
#> 15 2002     3    62346   89 101 2002-03-01
#> 16 2002     4    62346   37  27 2002-04-01


med <- data.frame(
   stringsAsFactors = FALSE,
                 ID = c(1L, 2L, 3L, 4L, 4L),
           postcode = c(12345L, 62346L, 12345L, 12345L, 62346L),
   treatmentDate_start = c("15/01/2001",
                          "15/01/2001","21/02/2002","15/03/2002","16/04/2002"),
    treatmentDate_end = c("16/02/2001",
                          "16/02/2001","16/03/2002","16/04/2002","30/04/2002")
  )  

med <- med %>% 
  mutate(across(.cols=contains("Date"), lubridate::dmy)) %>% #convert to class date
  pivot_longer(cols=contains("treatment"),
               names_to = "date_type",
               values_to = "date") %>% 
  mutate(date_floor=lubridate::floor_date(date,
                                           unit="month"))


df_join <- med %>% 
  left_join(., pollution) %>% 
  select(-date_floor) 
#> Joining, by = c("postcode", "date_floor")
df_join
#> # A tibble: 10 x 8
#>       ID postcode date_type           date        year month  PM10   NO2
#>    <int>    <int> <chr>               <date>     <int> <int> <int> <int>
#>  1     1    12345 treatmentDate_start 2001-01-15  2001     1    40    20
#>  2     1    12345 treatmentDate_end   2001-02-16  2001     2    30    25
#>  3     2    62346 treatmentDate_start 2001-01-15  2001     1    40    20
#>  4     2    62346 treatmentDate_end   2001-02-16  2001     2    30    25
#>  5     3    12345 treatmentDate_start 2002-02-21  2002     2    36    26
#>  6     3    12345 treatmentDate_end   2002-03-16  2002     3    30    20
#>  7     4    12345 treatmentDate_start 2002-03-15  2002     3    30    20
#>  8     4    12345 treatmentDate_end   2002-04-16  2002     4    32    22
#>  9     4    62346 treatmentDate_start 2002-04-16  2002     4    37    27
#> 10     4    62346 treatmentDate_end   2002-04-30  2002     4    37    27

df_join <- df_join %>% 
  pivot_wider(id_cols=c(ID, postcode, year, month, PM10, NO2),
              names_from = date_type,
              values_from = date) %>%
  mutate(treatmentDate_start = case_when(is.na(treatmentDate_start) ~ lubridate::floor_date(treatmentDate_end, unit="month"),
                                       TRUE ~ as.Date(treatmentDate_start ))) %>% 
  mutate(treatmentDate_end = case_when(is.na(treatmentDate_end) ~ lubridate::ceiling_date(treatmentDate_start, unit="month"),
                                       TRUE ~ as.Date(treatmentDate_end ))) %>% 
  mutate(duration=treatmentDate_end-treatmentDate_start) 

#this is basically all the info you need.
glimpse(df_join)
#> Rows: 9
#> Columns: 9
#> $ ID                  <int> 1, 1, 2, 2, 3, 3, 4, 4, 4
#> $ postcode            <int> 12345, 12345, 62346, 62346, 12345, 12345, 12345...
#> $ year                <int> 2001, 2001, 2001, 2001, 2002, 2002, 2002, 2002,...
#> $ month               <int> 1, 2, 1, 2, 2, 3, 3, 4, 4
#> $ PM10                <int> 40, 30, 40, 30, 36, 30, 30, 32, 37
#> $ NO2                 <int> 20, 25, 20, 25, 26, 20, 20, 22, 27
#> $ treatmentDate_start <date> 2001-01-15, 2001-02-01, 2001-01-15, 2001-02-01...
#> $ treatmentDate_end   <date> 2001-02-01, 2001-02-16, 2001-02-01, 2001-02-16...
#> $ duration            <drtn> 17 days, 15 days, 17 days, 15 days, 8 days, 15...


df_join %>% 
  group_by(ID, postcode) %>% 
  summarise(across(.cols=c(PM10, NO2, duration), .fns=sum)) %>% 
  mutate(across(.cols=c(PM10, NO2), .fns=function(x) x/as.numeric(duration)))
#> `summarise()` regrouping output by 'ID' (override with `.groups` argument)
#> # A tibble: 5 x 5
#> # Groups:   ID [4]
#>      ID postcode  PM10   NO2 duration
#>   <int>    <int> <dbl> <dbl> <drtn>  
#> 1     1    12345  2.19  1.41 32 days 
#> 2     2    62346  2.19  1.41 32 days 
#> 3     3    12345  2.87  2    23 days 
#> 4     4    12345  1.94  1.31 32 days 
#> 5     4    62346  2.64  1.93 14 days

^{Created on 2020-12-01 by the reprex package (v0.3.0)}

Merge between two files in terms of date and postcode with calculating the day-weighted average for the air pollution data in R

2 Answers