I have two data frames. One called Pollution. The second called Med. An example of both data frames as follows:
Pollution
year month postcode PM10 NO2
2001 1 12345 40 20
2001 2 12345 30 25
2001 3 12345 35 25
2001 4 12345 30 20
2001 1 62346 40 20
2001 2 62346 30 25
2001 3 62346 35 25
2001 4 62346 30 20
2002 1 12345 44 24
2002 2 12345 36 26
2002 3 12345 30 20
2002 4 12345 32 22
2002 1 62346 48 28
2002 2 62346 20 35
2002 3 62346 89 101
2002 4 62346 37 27
Med
ID postcode trementDate_start treatmentDate_end
1 12345 2001-01-15 2001-03-16
2 62346 2001-01-15 2001-02-16
3 12345 2002-02-21 2002-03-16
4 12345 2002-02-15 2002-04-16
4 62346 2002-03-16 2002-04-30
The idea is to link the (pollution) data with the (Med) data by the date and postcode.
To do that, I need to calculate an average pollution level based upon the number of days of exposure at a particular level (PM10, NO2).
First creating a column called num_of_day to calculate the day length in each month for the periods between the start and end date of treatment in the medical data frame. The subtracting idea between the end date to start date has been found not precise.
For Example ( I will take ID number 1 with a postcode of 12345 just for an explanation of how I calculated the pollution average for PM10 and NO2 by putting into consideration the days)
Med
ID postcode trementDate_start treatmentDate_end
1 12345 2001-01-15 2001-03-16
Pollution
year month postcode PM10 NO2
2001 1 12345 40 20
2001 2 12345 30 25
2001 3 12345 35 25
The air pollution of PM10 and NO2 values for the periods between 2001-01-15 and 2001-03-16 will be as follow:
- The trementDate_start (2001-01-15) its PM10 = 40 and NO2 = 20.
- the periods in between (2001-02-00) its PM10 = 30 and NO2 = 25.
- The trementDate_end (2001-03-16) its PM10 = 35 and NO2 = 25.
I have to then calculate the day of exposure for those periods each:
- the trementDate_start (2001-01-15) [January have total of 31 days] 15/31 = 0.48 days of exposure
- the periods in between (2001-02-00) [February have a total of 29 days] this should remain the same PM10 and NO2 values because the pollution measurements in the file are on monthly basis. So, it will be: 29/29 = 1 days of exposure
- the trementDate_end (2001-03-16) [March have total of 31 days] 16/31 = 0.51 days of exposure
Then I can calculate afterwards the pollution average based on the exposure days:
- the trementDate_start (2001-01-15) exposure days 0.48 * 40 = 19.2(for the PM10) and 0.48 * 20 = 9.6 (for the NO2)
- the periods in between (2001-02-00): 1 * 30 = 30 for PM10 and 1* 25 = 25 for NO2
- the trementDate_end (2001-03-16): 0.51 * 35 = 17.85 for PM10 and 0.51 * 25 = 12.75 for NO2
Then add the PM10 together (19.2 + 30 + 17.85 = 67.05). Then I will divide 67.05 by 3 months ( 3 month is the period were the person get exposed to the air pollution during his first treatment), which is equal to 22.35
The output should be like below:
ID postcode trementDate_start treatmentDate_end. PM10. NO2
1 12345 2001-01-15 2001-03-16 22.35 15.78
zoowalk, created the code below based on my previous requirement before I updated the information with the precise day thing. It worked perfectly.
I saw this post. stackoverflow.com/questions/15569333/…. I think this can short the idea of calculating the precise days that I explained above, which takes into account the fact that not all months and years have the same number of days, e.g., the leap year. Still cant figure out how to put them in a code with the other points looking up for the postcode and year and month.
I would appreciate extra help with this. I see it as too complex for me.