0
votes

Introduction

Summary:

  • Trying to average data by season (when necessary) when certain conditions are met.

Hello everyone.

I am currently working with numerous large data sets (>200 sets with >5000 rows each) of long-term time series data collection for multiple variables across different locations. So far, I've extracted data into separate CSV files per site and per station.

For the most part, the data reported per parameter is one instance per season.

Season here is defined ecologically as DJF, MAM, JJA, SON for months corresponding to Winter, Spring, Summer, and Fall respectively.

However, there are some cases where there were multiple readings during a seasonal event. Here, the parameter values and dates have to be averaged; this is before further analysis can take place on these data sets.

To complicate things even further, some of the data is marked by a Greater Than or Less Than (GTLT) symbol). In these cases, values and dates are not averaged unless the recorded value is the same.


Data Example

Summary:

  • Code and Tables show requested changes in data-set

So, for a data-driven example...

Here's a few rows from a data set.

Data.Example<-structure(list(
    Station.ID = c(13402, 13402, 13402, 13402, 13402, 13402), 
    End.Date = structure(c(2L, 3L, 4L, 2L, 3L, 1L), .Label = c("10/13/2016", "7/13/2016", "8/13/2016", "8/15/2016"), class = "factor"), 
    Parameter.Name = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Alkalinity", "Enterococci"), class = "factor"),
    GTLT = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("", "<"), class = "factor"), 
    Value = c(10, 10, 20, 30, 15, 10)), 
    .Names = c("Station.ID", "End.Date", "Parameter.Name","GTLT", "Value"), row.names = c(NA, -6L), class = "data.frame")

This is ideally what I would like as output

Data.Example.New<-structure(list(
    Station.ID.new = c(13402, 13402, 13402, 13402), 
    End.Date.new = structure(c(2L, 3L, 2L, 1L), .Label = c("10/13/2016", "7/28/2016", "8/15/2016"), class = "factor"), 
    Parameter.Name.new = structure(c(2L, 2L, 1L, 1L), .Label = c("Alkalinity", "Enterococci"), class = "factor"), 
    GTLT.new = structure(c(2L, 2L, 1L, 1L), .Label = c("", "<"), class = "factor"), 
    Value.new = c(10, 20, 22.5, 10)), 
    .Names = c("Station.ID.new", "End.Date.new", "Parameter.Name.new", "GTLT.new", "Value.new"), row.names = c(NA, -4L), class = "data.frame")

Here, the following things are occurring:

  • For Enterococci measured in July and Aug 13, there is a GTLT symbol, but Value for both == 10. So average dates. New row is 7/28/2016 and Value 10.
  • While Enterococci on Aug 15 is within same season as other values, since GTLT value is different, it would only be averaged in same season of same year with other values of 20. In this case, since it is only one where Value==20, that row does not change and is repeated in final data frame.
  • Alkalinity in July and August are same season, so average dates (7/28/16) and Value (22.5) in new row.
  • Alkalinity in October is different season, so keep row.
  • All other data (such as Station.ID and Parameter.Name) should just be copied since they shouldn't differ here.

If for some reason you have a GTLT and non-GTLT for same parameter:

End.Date    GTLT    Value    Parameter
7/13/2015     <      10         Alk
7/13/2016     <      10         Alk
8/13/2016            10         Alk
8/15/2016            20         Alk

Then final result would be

End.Date    GTLT    Value    Parameter
7/13/2015     <      10         Alk
7/13/2016     <      10         Alk
8/14/2016            15         Alk

Approach

Summary:

  • Define seasons and then aggregate using package like dplyr?
  • Create loop function to read row by row (after sort by Parameter.Name then Date?)

As one might expect, this is where I'm stuck.

I know seasons can be defined in R from prior Stack Q's:

New vector of seasons based on dates

And I know that average/aggregation packages such as dplyr (and possibly zoo?) can do chaining commands.

My issue is putting this thought process into code that can be repeated for each data set.

I'm not sure if that's the best approach (define seasons and then set conditions for averaging data), or if some sort of loop function would work here by going through row by row of the data set post-sort by Parameter.Name then End.Date.

I quickly sketched my thoughts on what some sort of loop function would have to include:

Rough idea of flow diagram

Note, you can't just average starting row [i] and [i+1] because [i+2], etc. might need averaged as well. Hence finding row [i+n] that breaks loop before last step, averaging all prior rows [i+n-1], and moving on to next new row [i+n].

Further, as clarification, the season would have to be within season of that annual cycle. So 7/13/2016 == 8/13/2016 for same season. 12/12/2015 == 01/01/2016 for same season. But 4/13/2016! == 4/13/2015 in regards to averaging.


Conclusion and Summary

In short, I need help designing code to average individual parameter time-series values by annual season with specific exceptions for multiple large data sets.

I'm not sure of the best approach in designing code to do this, whether it's a large loop function or a combination of code and specialized chaining-enabled packages.

Thank you for your time in advance.

Cheers,

soccernamlak

1
I think you got a downvote because the question is very long. You could probably have asked this question in a shorter, simpler way. That said, you were clear and provided sample data, so I think you're okay. I wanted to make sure I understand a certain point you make. You say 12/12/2015 == 01/01/2013, but 4/13/2016 != 4/13/2015. Are you sure you didn't mean 12/12/2015 == 01/01/*2015*? - jmartindill
I do apologize for the length. In regards to the date, I've edited the primary post, but I actually meant 12/12/2015 == 01/01/2016 In short, December 2015 and January 2016 are considered the same season ecologically. - soccernamlak

1 Answers

0
votes

Using dplyr and lubridate I was able to come up with a solution. My output matches your example output, except I did not keep the exact dates, which I felt were misleading in the final result.

Data.Example<-structure(list(
  Station.ID = c(13402, 13402, 13402, 13402, 13402, 13402), 
  End.Date = structure(c(2L, 3L, 4L, 2L, 3L, 1L), .Label = c("10/13/2016", "7/13/2016", "8/13/2016", "8/15/2016"), class = "factor"), 
  Parameter.Name = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Alkalinity", "Enterococci"), class = "factor"),
  GTLT = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("", "<"), class = "factor"), 
  Value = c(10, 10, 20, 30, 15, 10)), 
  .Names = c("Station.ID", "End.Date", "Parameter.Name","GTLT", "Value"), row.names = c(NA, -6L), class = "data.frame")

# Create season key
seasons <- data.frame(month = 1:12, season = c(rep("DJF",2), rep("MAM", 3), rep("JJA", 3), rep("SON",3), "DJF"))

# Isolate Month and Year, create Season column
Data.Example$Month <- lubridate::month(as.Date((Data.Example$End.Date), "%m/%d/%Y"))
Data.Example$Year <- lubridate::year(as.Date((Data.Example$End.Date), "%m/%d/%Y"))
Data.Example$Season <- seasons$season[Data.Example$Month]

# Update 'year' where month = December so that it is grouped with Jan and Feb of following year
Data.Example$Year[Data.Example$Month == 12] <- Data.Example$Year[Data.Example$Month == 12]+1

# Find out which station/year/season/paramaters have at least one record with a GTLT
GTLT.Test<- Data.Example %>% 
  group_by(Station.ID, Year, Season, Parameter.Name) %>%
  summarize(has_GTLT = max(nchar(as.character(GTLT))))

# First only calculate averages for groups without any GTLT
Data.Example.New1 <- Data.Example %>% 
  anti_join(GTLT.Test[GTLT_test$has_GTLT == 1,], 
            by = c("Station.ID", "Year", "Season", "Parameter.Name")) %>%
  group_by(Station.ID, Year, Season, Parameter.Name, GTLT) %>%
  summarize(Value.new = mean(Value))

# Now do the same for groups with GTLT, only combining when values and GTLT symbols match.
Data.Example.New2 <- Data.Example %>% 
  anti_join(GTLT.Test[GTLT_test$has_GTLT == 0,], 
            by = c("Station.ID", "Year", "Season", "Parameter.Name")) %>%
  group_by(Station.ID, Year, Season, Parameter.Name, GTLT, Value) %>%
  summarize(Value.new = mean(Value)) %>%
  select(-Value)

# Combine both
Data.Example.New <- rbind(Data.Example.New1, Data.Example.New2)

EDIT: I just noticed you linked to another SO question for converting dates to seasons. Mine simply converts by month, not date, and does not use actual seasons. I did this because in your example, Dec. 12 matches with Jan. 1. December 12 is technically fall, so I assumed you weren't using actual seasons, but were instead using four three-month groupings.