0
votes

I have a dataframe called 'madrid' that has the measurements of air pollutants along with the date, year, month, and day. There's data from 2001 to 2018

data.frame':    3808224 obs. of  15 variables:
 $ date   : Date, format: "2001-08-01" "2001-08-01" "2001-08-01" ...
 $ BEN    : num  NA 1.5 NA NA NA ...
 $ CO     : num  0.37 0.34 0.28 0.47 0.39 ...
 $ EBE    : num  NA 1.49 NA NA NA ...
 $ NMHC   : num  NA 0.07 NA NA NA ...
 $ NO_2   : num  58.4 56.2 50.7 69.8 22.8 ...
 $ O_3    : num  34.5 42.2 46.3 40.7 66.3 ...
 $ PM10   : num  105 100.6 100.1 69.8 75.2 ...
 $ SO_2   : num  6.34 8.11 7.85 6.46 8.8 ...
 $ TCH    : num  NA 1.24 NA NA NA ...
 $ TOL    : num  NA 10.8 NA NA NA ...
 $ station: int  28079001 28079035 28079003 28079004 28079039 28079006 28079007 28079009 28079038 28079011 ...
 $ year   : num  2001 2001 2001 2001 2001 ...
 $ month  : num  8 8 8 8 8 8 8 8 8 8 ...
 $ day    : int  1 1 1 1 1 1 1 1 1 1 ...

and I'm now trying to create a plot that shows the monthly total average of each pollutant. So, ideally one line/curve with the months plotted on the x-axis to see if there's seasonality. So far, I've done this (below) with dplyr:

madrid_season <- madrid %>%
  group_by(month, year) %>%
  summarise(BEN_mean = mean(BEN, na.rm = TRUE),
            CO_mean = mean(CO, na.rm = TRUE),
            EBE_mean = mean(EBE, na.rm = TRUE),
            NMHC_mean = mean(NMHC, na.rm = TRUE),
            NO_2_mean = mean(NO_2, na.rm = TRUE),
            O_3_mean = mean(O_3, na.rm = TRUE),
            PM10_mean = mean(PM10, na.rm = TRUE),
            SO_2_mean = mean(SO_2, na.rm = TRUE),
            TCH_mean = mean(TCH, na.rm = TRUE),
            TOL_mean = mean(TOL, na.rm = TRUE)) 

and then I sum the averages (excluding the month and year column):

madrid_season$tot <- rowSums(madrid_season[,2:11], na.rm = TRUE)

and then I try and plot:

p4 <- ggplot(madrid_season, aes(x=month,y=tot))+geom_point()+geom_line()

except this plots the total monthly average for every year, instead of for the single month for each year, so I end up with a wacky looking plot.

I think I need help using dplyr in this case (or open to other suggestions!)

1
How would you like to show the years and months on the plot? A x axis with all the years and months in chronological order? Superimposed plots for each year? One panel for each year?user2332849
@user2332849 Yeah ideally I'd like the years and months in chronological order if possibledejsdukes
Are you sure you want to sum all those air quality measurements?Edward
Put a Date class variable on the x-axis, like the first day of each month. Add in a column that's as.Date(paste(year, month, "01", sep = "-")) and put that on the x-axis.Gregor Thomas
@dejsdukes: Could you make your problem reproducible by sharing a sample of your data so others can help (please do not use str(), head() or screenshot)? You can use the reprex and datapasta packages to assist you with that. See also Help me Help you & How to make a great R reproducible example?Tung

1 Answers

1
votes

I wouldn't advise summing each monthly mean measurement. The measurement with the largest scale, here it looks like PM10, would swamp all other measurements.

You can either plot each one individually to see the seasonality, or reshape into long format and use the measurement as a facet in the ggplot.

I'll use the mydata dataset from the openair package since you haven't provided any data. The variables are similar to yours.

library(openair)

str(mydata)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   65533 obs. of  10 variables:
 $ date: POSIXct, format: "1998-01-01 00:00:00" "1998-01-01 01:00:00" ...
 $ ws  : num  0.6 2.16 2.76 2.16 2.4 3 3 3 3.36 3.96 ...
 $ wd  : int  280 230 190 170 180 190 140 170 170 170 ...
 $ nox : int  285 NA NA 493 468 264 171 195 137 113 ...
 $ no2 : int  39 NA NA 52 78 42 38 51 42 39 ...
 $ o3  : int  1 NA 3 3 2 0 0 0 1 2 ...
 $ pm10: int  29 37 34 35 34 16 11 12 12 12 ...
 $ so2 : num  4.72 NA 6.83 7.66 8.07 ...
 $ co  : num  3.37 NA 9.6 10.22 8.91 ...
 $ pm25: int  NA NA NA NA NA NA NA NA NA NA ...

library(dplyr)
library(lubridate)
library(ggplot2)

mydata %>%
  mutate(Year=factor(year(date)), Month=month(date, label = TRUE)) %>%
    group_by(Year, Month) %>%
  summarise_all(list(mean=mean), na.rm=TRUE) %>%
  ggplot(aes(y=o3_mean, x=Month, group=Year, col=Year)) +
  geom_line(lwd=1) + ylab(expression(O[3]))

enter image description here

So seasonality of O3 is evident.

If you want to plot all measurements, then reshape to long form, and facet on the measurement.

library(tidyr)
mydata %>%
  mutate(Year=factor(year(date)), Month=month(date, label = TRUE)) %>%
  group_by(Year, Month) %>%
  summarise_all(list(mean=mean), na.rm=TRUE) %>%
  select(-date_mean) %>%
  pivot_longer(cols=ends_with("_mean"), 
               names_to = c("measurement", ".value"), 
               names_sep="_") %>%
  ggplot(aes(y=mean, x=Month, group=Year, col=Year)) +
  geom_line() +
  facet_wrap(~measurement, scales="free") +
  scale_x_discrete(labels=function(x) substr(x, 1, 1))

enter image description here