R subset returns limited set with warning

Question

I have a data.frame with 12500 obs. of 8 variables, one of which is State (taxes$State). I want to subset the data down to multiple states that I get from user input in Shiny, but I kept getting dropped data when I added more than one state. I've got down to the subset function that is acting funky. I get no warning with just two states, but the third throws the exception. In every case, I'm limited to 250 obs. on the return. For example

temp<-subset(taxes, State==c("AL","MO",TX))

Warning message:1: In is.na(e1) | is.na(e2) : longer object length is not a multiple of shorter object length 2: In ==.default(State, c("AL", "MO", "TX")) : longer object length is not a multiple of shorter object length

I've tried other variables also with the same result

temp<-subset(taxes,StateFullName==c("Iowa","Missouri","Texas"))

Warning message: In StateFullName == c("Iowa", "Missouri", "Texas") : longer object length is not a multiple of shorter object length

Any ideas as to why I'm limited to 250 obs?

akrun akrun · Accepted Answer · 2015-08-15T04:56:37

You just need %in% to compare a vector of length > 1 i.e.

subset(taxes, State %in% c('AL', 'MO', 'TX'))
#   State amount
#4     MO  14143
#27    TX  11517
#30    AL  14465

Or using data.table, we convert the 'data.frame' to 'data.table' (setDT(taxes), set the key column as 'State' and extract the rows that have 'MO', 'TX', 'AL' in the 'State'.

library(data.table)
setDT(taxes, key='State')[c('MO', 'TX', 'AL')]
#    State amount
#1:    MO  14143
#2:    TX  11517
#3:    AL  14465

To understand why your code didn't work, let's check the logical vector output.

with(taxes, State==c('AL', 'MO', 'TX'))
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [49] FALSE FALSE

Warning messages: 1: In is.na(e1) | is.na(e2) : longer object length is not a multiple of shorter object length

None of the elements were TRUE for this example. The way it compares is based on the recycling. The first 3 elements of 'State' is compared with the vector 'AL', 'MO', and 'TX' in that order

 taxes$State[1:3] == c('AL', 'MO', 'TX')
 #[1] FALSE FALSE FALSE

Here, we compare element-by-element between corresponding values of both the vectors and as

  taxes$State[1:3]
  #[1] AK AL AR

is not matching the 'AL', 'MO', and 'TX' at the corresponding positions, it returns 'FALSE'.

The same way, it is compared up to the length of 'State' column, i.e. the next comparison is

 taxes$State[4:6] == c('AL', 'MO', 'TX')
 #[1] FALSE FALSE FALSE

Here also all are FALSE as the corresponding 'State' elements were 'AZ', 'CA', and 'CO'. We get a warning at the end because

 nrow(taxes)
 #[1] 50

50%%3!=0

If the nrow of the dataset is 51, the warning will not be there, but still as comparison is based on position, we may not the result as intended.

data

set.seed(24)
taxes <- data.frame(State=sample(state.abb), 
       amount=sample(400:20000, 50, replace=TRUE), stringsAsFactors=FALSE)

R subset returns limited set with warning

3 Answers

data