1
votes

I have Date/Time information where I want to get the average, min, max, range of the dates across "seasons" grouped by years and the only way I have been slightly successful of doing this is with tapply. The closest is the summary function in tapply. Results are what I only assume is a list by year of the min, max, 1st Qu, median etc. These are not exactly what i need but works fine.

Example of results I get from the tapply function from 1 season:

$`2003`
                     Min.                   1st Qu.                    Median                      Mean                   3rd Qu.                      Max. 
"2003-04-22 00:00:00 UTC" "2003-05-03 00:00:00 UTC" "2003-05-12 00:00:00 UTC" "2003-05-10 02:00:00 UTC" "2003-05-18 00:00:00 UTC" "2003-05-21 00:00:00 UTC" 

$`2004`
                     Min.                   1st Qu.                    Median                      Mean                   3rd Qu.                      Max. 
"2004-04-07 00:00:00 UTC" "2004-04-13 00:00:00 UTC" "2004-05-10 00:00:00 UTC" "2004-05-01 07:08:56 UTC" "2004-05-11 12:00:00 UTC" "2004-05-20 00:00:00 UTC" 

What I want to do is combine these lists I generate for different seasons into one big happy dataframe that I can export into a csv. I have searched and searched and am getting even more confused. Most people suggest the simplest is

test = do.call(rbind, sSM, eSM, sC) #note here sSM, eSM, sC are my "seasons"

However this gives me an error message or an empty "test" frame.

Ive read about the plyr package and assume that ddply instead of the tapply to force the result into a dataframe should work, but I can't figure out how to get that to even work because I can't get it to do the functions I would need such as the mean, min, max, & range...

The end result I would like is something like:

Season   Year   Min           Max      Mean        Median  
sSM     2003   2003-04-21   2003-5-1  2003-4-25   2003-4-23
eSM     2003...
sSM     2004...
eSM     2004...

THEN I would like to do the exact same, only determine within the Individual across years. I have an ID field in which in some years the same individual has been measured. I would like to get a mean start date for that individual across years within each season. Leaving out those individuals who were not measured over multiple years.

As I do not understand how to manipulate lists and arrays and even what tapply is doing I am at a loss as to how to fix all of this. Do I create an empty dataframe and put all this stuff into it in some loop or something? Below is some sample data from 4 "seasons" across a couple of years notice some dates have the date and time stamps and some just have the date.

    structure(list(Year = c(2003L, 2003L, 2003L, 2003L, 2003L, 2003L, 
2003L, 2004L, 2004L, 2004L, 2004L, 2004L, 2004L, 2005L, 2005L, 
2005L, 2005L, 2005L, 2005L, 2006L, 2006L, 2006L, 2006L), ID = structure(c(11L, 
12L, 13L, 14L, 15L, 16L, 17L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 
9L, 10L, 14L, 15L, 1L, 2L, 3L, 6L), .Label = c("c_002", "c_102", 
"c_104", "c_105", "c_109", "c_401", "c_9814", "c_9815", "c_9816", 
"c_9819", "c_9901", "c_9902", "c_9905", "c_9908", "c_9911", "c_9912", 
"c_9916"), class = "factor"), sSM = structure(c(10L, 10L, 9L, 
17L, 8L, 8L, 7L, 18L, NA, 1L, NA, 15L, 13L, 12L, 6L, 3L, 5L, 
2L, 4L, 16L, 14L, 11L, 11L), .Label = c("04/07/2004 15:00", "04/23/2005 10:01", 
"04/25/2005 03:01", "04/27/2005 02:00", "04/27/2005 08:00", "04/29/2005 04:00", 
"05/01/2003", "05/03/2003", "05/04/2003", "05/05/2003", "05/05/2006", 
"05/07/2005 16:01", "05/11/2004 11:00", "05/11/2006", "05/13/2004 08:00", 
"05/14/2006", "05/17/2003", "05/17/2004 12:02"), class = "factor"), 
    eSM = structure(c(13L, 18L, 15L, 21L, 16L, 10L, 14L, 20L, 
    NA, 1L, NA, 11L, 7L, 5L, 6L, 17L, 3L, 2L, 4L, 19L, 9L, 12L, 
    8L), .Label = c("04/27/2004 05:00", "05/01/2005 05:00", "05/06/2005 05:00", 
    "05/08/2005 07:01", "05/12/2005 21:00", "05/15/2005 19:00", 
    "05/18/2004 13:00", "05/18/2006", "05/20/2006", "05/21/2003", 
    "05/21/2004 01:01", "05/23/2006", "05/24/2003", "05/25/2003", 
    "05/26/2003", "05/27/2003", "05/27/2005 01:00", "05/28/2003", 
    "05/28/2006", "06/01/2004 02:01", "06/03/2003"), class = "factor"), 
    sC = structure(c(9L, 12L, 16L, 19L, 18L, 12L, 7L, 13L, NA, 
    10L, NA, 20L, 4L, 14L, 11L, 5L, 1L, 2L, 3L, 8L, 17L, 6L, 
    15L), .Label = c("05/26/2005 00:00", "05/26/2005 10:00", 
    "05/27/2005 06:01", "05/28/2004 08:00", "05/29/2005 23:01", 
    "05/29/2006", "05/30/2003", "05/30/2006", "05/31/2003", "05/31/2004 06:01", 
    "05/31/2005 15:00", "06/01/2003", "06/01/2004 07:02", "06/01/2005 16:00", 
    "06/01/2006", "06/03/2003", "06/03/2006", "06/04/2003", "06/05/2003", 
    "06/05/2004 01:00"), class = "factor"), eC = structure(c(11L, 
    2L, 15L, 6L, 17L, 17L, 2L, 8L, NA, 7L, NA, 13L, 5L, 9L, 9L, 
    3L, 10L, 1L, 4L, 12L, 14L, 16L, 12L), .Label = c("06/03/2005 18:00", 
    "06/04/2003", "06/04/2005 04:01", "06/05/2005 05:01", "06/06/2004 22:00", 
    "06/07/2003", "06/07/2004 23:00", "06/08/2004 19:01", "06/08/2005 03:00", 
    "06/10/2005 20:00", "06/12/2003", "06/13/2006", "06/14/2004 00:00", 
    "06/14/2006", "06/16/2003", "06/18/2006", "06/19/2003"), class = "factor")), .Names = c("Year", 
"ID", "sSM", "eSM", "sC", "eC"), class = "data.frame", row.names = c(NA, 
-23L))

Here is some code Ive written thus far:

dates$StartSM =  as.POSIXct(strptime(dates$sSM,"%m/%d/%Y",tz="UTC"),tz="UTC")                 
dates$EndSM =  as.POSIXct(strptime(dates$eSM,"%m/%d/%Y",tz="UTC"),tz="UTC")
dates$EndC =  as.POSIXct(strptime(dates$eC,"%m/%d/%Y %H:%M",tz="UTC"),tz="UTC")
dates$StartC =  as.POSIXct(strptime(dates$sC,"%m/%d/%Y %H:%M",tz="UTC"),tz="UTC")
sSpringM = tapply(dates$StartSM, dates$Year, summary)
eSpringM  = tapply(dates$EndSM, dates$Year, summary)
sCalving = as.vector(tapply(dates$StartC, dates$Year, summary))
eCalving = tapply(dates$EndC, dates$Year, summary)
datadates = do.call(rbind, sSpringM, eSpringM, sCalving)
1
Can you replace the human readable version of your data with the output of dput(youDataGoesHere)? It will be much easier to make sure that the solution developed works for you since data types et al will match up exactly.Chase
@Chase, sorry - I started to do that but then didn't....guess its always best to add it. I just thought the question was getting to long. I just want people to know Ive been trying to come up with something and making an attempt.Kerry
No worries - I find dput() especially helpful when date/time formats are involved...differences therein seem to be the source of a lot of confusion. I'm still confused as to how you are defining spring and summer? Do the columns sSM, eSM, eC, sC have any value after you convert them to proper time formats above?Chase
Can you elaborate on the 'Seasons' aspect of you data? As DWin notes below there doesn't appear to be any Seasons in the data you provide, and additionally I'd point out that all the dates in that sample are between April-June.joran
@Joran Sorry for the confusion over season - I should have been more clear. In my "discussion" I used season as an example and put it into terms that in general people would understand. I didn't think it would matter what they were labeled in the actual data. I will correct this. As for my sample data - I have 126,000 data points. Limiting to April & June is my way of cutting things down. I have 15 "seasons" - so I figured what ever help I could get, I would extend it to go through each of my seasons on my own.Kerry

1 Answers

0
votes

You have (at the top anyway) a list, call it "yourlist". Try this to print them:

lapply(yourlist, format, "%Y-%m-%d")

If you want them in a matrix:

do.call(rbind, sapply(yourlist, unlist))

I don't see any "winter", "spring", etc in your structure. I have in the past used the data.table package's year, quarter, month and week functions to return a vector of numeric values for years, quarters, months and weeks. It's easy to plot with the integer years plus appropriate fractional additions like year(.)+quarter/4. Easier than the corresponding cut and label hassles that would otherwise result.