tapply returns NA for every level of the factor index or insists the object and index are different lengths

Question

I'm trying to use tapply to get the average weight of turtles caught per day. tapply returns NA for every date value (class:POSIXct) for every approach I've tried

I've tried: calling tapply on the weight column and date column -> arguments are different lengths error

removing records with NA values in the weight column of my dataframe then calling tapply on the weight column and date column. -> arguments are different lengths error

calling tapply on the na.omit call of the weight column and the date column indexed by the na.omit call of the weight column -> arguments are different lengths error

calling tapply on the na.omit call of the weight column and the factor-coerced date column indexed by the na.omit call of the weight column -> returns NA for every level of the factor-coerced date column

head of original dataframe

> head(stinkpotData)
       Date     DateCt  Species Turtle.ID ID.Code             Location Recapture Weight.g C.Length.mm
1  6/1/2001 2001-06-01 Stinkpot         1       1   keck lab dock site         0      190          95
2  6/1/2001 2001-06-01 Stinkpot         2      10        Right of dock         0      200         100
3  8/9/2001 2001-08-09 Stinkpot         2      10 #4 Deep Right of lab         1      175         104
4 8/27/2001 2001-08-27 Stinkpot         2      10 #4 Deep Right of lab         1      175         105
5  6/1/2001 2001-06-01 Stinkpot         3      11        Right of dock         0      200         109
6 10/3/2001 2001-10-03 Stinkpot         3      11 #4 Deep Right of lab         1      205         109
  C.Width.mm Female.1.Male.2 Rotation                                  Marks
1         70            <NA>     <NA>                                   <NA>
2         72            <NA>     <NA>                                   <NA>
3         72               2     <NA>                                   Male
4         71               2     <NA>    male, 1 small leech Right front leg
5         74            <NA>     <NA>                          algae covered
6         76               2     <NA> male, 1 lg & 1 sm leech right rear leg

head of the original dataframe with records with NA weights omitted (checked that NAs were actually omitted)

> head(noNAWeightsDf)
       Date     DateCt  Species Turtle.ID ID.Code             Location Recapture Weight.g C.Length.mm
1  6/1/2001 2001-06-01 Stinkpot         1       1   keck lab dock site         0      190          95
2  6/1/2001 2001-06-01 Stinkpot         2      10        Right of dock         0      200         100
3  8/9/2001 2001-08-09 Stinkpot         2      10 #4 Deep Right of lab         1      175         104
4 8/27/2001 2001-08-27 Stinkpot         2      10 #4 Deep Right of lab         1      175         105
5  6/1/2001 2001-06-01 Stinkpot         3      11        Right of dock         0      200         109
6 10/3/2001 2001-10-03 Stinkpot         3      11 #4 Deep Right of lab         1      205         109
  C.Width.mm Female.1.Male.2 Rotation                                  Marks
1         70            <NA>     <NA>                                   <NA>
2         72            <NA>     <NA>                                   <NA>
3         72               2     <NA>                                   Male
4         71               2     <NA>    male, 1 small leech Right front leg
5         74            <NA>     <NA>                          algae covered
6         76               2     <NA> male, 1 lg & 1 sm leech right rear leg

calling tapply on the columns in the original dataframe

> tapply(stinkpotData$Weight.g, stinkpotData$DateCt, FUN = mean)
Error in tapply(stinkpotData$Weight.g, stinkpotData$DateCt, FUN = mean) : 
  arguments must have same length

calling tapply on the columns in the noNA dataframe

>tapply(noNAWeightsDf$Weight.g, noNAWeightsDf$DateCt, FUN = mean)
Error in tapply(noNAWeightsDf$Weight.g, noNAWeightsDf$DateCt, FUN = mean) : 
  arguments must have same length

calling tapply on the na.omit call of the weight column and the date column

> tapply(na.omit(stinkpotData$Weight.g), stinkpotData$DateCt[!is.na(stinkpotData$Weight.g)], FUN = mean)
Error in tapply(na.omit(stinkpotData$Weight.g), stinkpotData$DateCt[!is.na(stinkpotData$Weight.g)],  : 
  arguments must have same length

calling tapply on the na.omit call of the weight column and the factor-

coerced date column indexed by the na.omit call of the weight column 
tapply(na.omit(stinkpotData$Weight.g), as.factor(stinkpotData$DateCt[!is.na(stinkpotData$Weight.g)]), FUN = mean)
2001-01-07 2001-06-01 2001-06-04 2001-06-06 2001-06-07 2001-06-11 2001-06-12 2001-06-15 2001-06-19 
        NA         NA         NA         NA         NA         NA         NA         NA         NA 
2001-06-20 2001-06-25 2001-06-27 2001-06-29 2001-07-03 2001-07-09 2001-07-11 2001-07-13 2001-07-16 
        NA         NA         NA         NA         NA         NA         NA         NA         NA ................etc

There were 50 or more warnings (use warnings() to see the first 50)

calling warnings() after the above error gives:

> warnings()
Warning messages:
1: In mean.default(X[[i]], ...) :
  argument is not numeric or logical: returning NA
2: In mean.default(X[[i]], ...) :
  argument is not numeric or logical: returning NA
3: In mean.default(X[[i]], ...) :
  argument is not numeric or logical: returning NA
.......................etc

EDIT:

split(na.omit(stinkpotData$Weight.g), as.factor(stinkpotData$DateCt[!is.na(stinkpotData$Weight.g)])) Gave a list of the individual weights of turtles on each date. Verified that it was of mode list. Its elements were of mode numeric, class factor. lapply on the split list with FUN=mean still returned NA for each level of date. Can get means of individual elements of the split list of coerced to vectors but not quite what I need.

EDIT 2: Finally got the result I wanted, but the steps to get there seem over-complicated and I still don't understand why using tapply won't work. I had to call split as in the first edit, then coerce each element of the resultant list to class numeric (originally returned as class factor) with lapply, then call mean on every element with lapply:

weightsDateList = split(na.omit(stinkpotData$Weight.g), as.factor(stinkpotData$DateCt[!is.na(stinkpotData$Weight.g)]))
weightsDateList = lapply(weightsDateList, FUN = as.numeric)
weightsDateList = lapply(weightsDateList, FUN = mean)

EDIT 3: I realize now that the result I get from the solution in EDIT 2 and calling tapply( severely underestimates the means, so still lost.

EDIT 4: Realized that converting weight to class numeric returned the number of the level of the weight from when it was a factor, which explains the severe underestimation of means.

I want the tapply call to return every date with turtle weight(s) and its respective average weight of turtles caught on those dates. Thanks and I apologize if I'm missing something easy.

Have you tried aggregate(Weight.g ~ DateCt, data = stinkpotData, mean) — heds1
unless you have reason to, I'd recommend against using tapply. data.table and dplyr both offer much easier grouping facilities. I'm quite partial to data.table but I recommend checking out both & seeing what suits you — MichaelChirico
I don't see a problem with tapply but I'm quite partial to base R. Many of its methods offer grouping facilities: tapply, by, split, ave, aggregate to name a few. I recommend checking these out & seeing what suits you. — Parfait
Please dput a few rows of your actual original dataframe that reproduces this error. Did you check NAs in DateCt? — Parfait
@heds1 aggregate(Weight.g ~ DateCt, data = stinkpotData, mean) gave an invalid type error as DateCt is a list it seems. I coerced it to a factor then called aggregate but it just returned NA for every level again — smbritton

Parfait Parfait · Accepted Answer · 2019-07-06T19:43:44

Generally, to use tapply you must heed the following rules regarding its arguments:

First argument must be or cast-able to a logical, integer, or numeric. Factors, characters, or other types cannot be used here.
Second argument must be or cast-able to a factor which can be any basic data type with exceptions to more complex types. This includes multiple groupings if using list() where tapply then returns a matrix.
- Because this argument only takes factors, it is redundant to cast with as.factor() which likely tapply does already under the hood.
Third argument must be a function that returns an atomic, numeric value for each input (i.e., first argument) sliced by group(s) (i.e., second argument).
Length: The first and second argument must be the same length which is given if both derive from a data frame as data frames by definition is a class object of list type that contains equal length atomic vectors.
- Because of this rule, avoid running different operations on first or second argument as it can result in different lengths. Instead, either run same operation on both vectors or even better run operation on the whole data frame before calling tapply:
- Since each NA maintains a length of one (unlike NULL), its presence does not matter in tapply. However, the child function can have issues with NA that tapply raises upstream.

Specifically, your issue regards the original types: factor type of Weight.g, and POSIXlt type of DateCt. Consider converting these types to adhere to tapply.

But do not directly cast these original types to factor as its underlying numerics or factor level number will result causing undesirable results. For numeric conversion, cast first to character. For POSIXlt cast to Date or character. Below demonstrates with OP's dput of first ten rows with other grouping methods.

Data (only two relevant columns)

stinkpotDataDeparsed <- structure(list(Weight.g = structure(c(15L, 13L, 20L, 16L, 15L, 
12L, NA, 12L, 15L, 20L, 26L), .Label = c("100", "105", "106", 
"107", "110", "115", "1150", "120", "125", "126", "128", "130", 
"135", "138", "140", "145", "150", "155", "159", "160", "165", 
"168", "170", "175", "180", "185", "187", "190", "195", "20", 
"200", "205", "210", "215", "220", "225", "230", "235", "245", 
"250", "40", "45", "50", "55", "60", "65", "70", "75", "80", 
"85", "90", "95", "oops!"), class = "factor"), DateCt = structure(list(
    sec = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), min = c(0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), hour = c(0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), mday = c(20L, 30L, 8L, 29L, 
    23L, 26L, 12L, 17L, 29L, 13L, 4L), mon = c(8L, 8L, 10L, 10L, 
    5L, 5L, 6L, 6L, 6L, 5L, 5L), year = c(101L, 101L, 101L, 101L, 
    102L, 102L, 102L, 102L, 102L, 103L, 101L), wday = c(4L, 0L, 
    4L, 4L, 0L, 3L, 5L, 3L, 1L, 5L, 1L), yday = c(262L, 272L, 
    311L, 332L, 173L, 176L, 192L, 197L, 209L, 163L, 154L), isdst = c(0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), zone = c("EST", 
    "EST", "EST", "EST", "EST", "EST", "EST", "EST", "EST", "EST", 
    "EST"), gmtoff = c(NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_)), .Names = c("sec", 
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst", 
"zone", "gmtoff"), class = c("POSIXlt", "POSIXt"), tzone = c("EST", 
"EST", "   "))), .Names = c("Weight.g", "DateCt"), row.names = 60:70, class = "data.frame")

Cleaning

# REMOVE NAs FROM DATA FRAME TO RUN ON ALL COLUMNS BUT DOES NOT MATTER W/ tapply
stinkpotDataDeparsed <- stinkpotDataDeparsed[!is.na(stinkpotDataDeparsed$Weight.g),]

# CAST FACTOR TYPE TO NUMERIC    
stinkpotDataDeparsed$Weight.g <- as.numeric(as.character(stinkpotDataDeparsed$Weight.g))

# CAST POISXlt TO DATE OR CHARACTER FOR FACTOR-ABILITY
stinkpotDataDeparsed$DateCt <- as.Date(stinkpotDataDeparsed$DateCt)
# stinkpotDataDeparsed$DateCt <- as.character(stinkpotDataDeparsed$DateCt)

Tapply (returns a vector)

with(stinkpotDataDeparsed, tapply(Weight.g, DateCt, mean))     

# 2001-06-04 2001-09-20 2001-09-30 2001-11-08 2001-11-29 2002-06-23 2002-06-26 2002-07-17 2002-07-29 2003-06-13 
#        185        140        135        160        145        140        130        130        140        160

Aggregate (returns a data frame)

aggregate(Weight.g ~ DateCt, data = stinkpotDataDeparsed, mean)

#        DateCt Weight.g
# 1  2001-06-04      185
# 2  2001-09-20      140
# 3  2001-09-30      135
# 4  2001-11-08      160
# 5  2001-11-29      145
# 6  2002-06-23      140
# 7  2002-06-26      130
# 8  2002-07-17      130
# 9  2002-07-29      140
# 10 2003-06-13      160

Ave (returns vector of same length as input, so can be assigned a data frame column)

stinkpotDataDeparsed$Wgt.Mean <- with(stinkpotDataDeparsed, ave(Weight.g, DateCt, FUN=mean))
stinkpotDataDeparsed

#    Weight.g     DateCt Wgt.Mean
# 60      140 2001-09-20      140
# 61      135 2001-09-30      135
# 62      160 2001-11-08      160
# 63      145 2001-11-29      145
# 64      140 2002-06-23      140
# 65      130 2002-06-26      130
# 67      130 2002-07-17      130
# 68      140 2002-07-29      140
# 69      160 2003-06-13      160
# 70      185 2001-06-04      185

By (object-oriented wrapper to tapply, returns a list)

by(stinkpotDataDeparsed, stinkpotDataDeparsed$DateCt, FUN=function(sub) mean(sub$Weight.g))

# stinkpotDataDeparsed$DateCt: 2001-06-04
# [1] 185
# ------------------------------------------------------------ 
# stinkpotDataDeparsed$DateCt: 2001-09-20
# [1] 140
# ------------------------------------------------------------ 
# stinkpotDataDeparsed$DateCt: 2001-09-30
# [1] 135
# ------------------------------------------------------------ 
# stinkpotDataDeparsed$DateCt: 2001-11-08
# [1] 160
# ------------------------------------------------------------ 
# stinkpotDataDeparsed$DateCt: 2001-11-29
# [1] 145
# ------------------------------------------------------------ 
# stinkpotDataDeparsed$DateCt: 2002-06-23
# [1] 140
# ------------------------------------------------------------ 
# stinkpotDataDeparsed$DateCt: 2002-06-26
# [1] 130
# ------------------------------------------------------------ 
# stinkpotDataDeparsed$DateCt: 2002-07-17
# [1] 130
# ------------------------------------------------------------ 
# stinkpotDataDeparsed$DateCt: 2002-07-29
# [1] 140
# ------------------------------------------------------------ 
# stinkpotDataDeparsed$DateCt: 2003-06-13
# [1] 160

Rextester Demo