1
votes

I have a data.frame with 12500 obs. of 8 variables, one of which is State (taxes$State). I want to subset the data down to multiple states that I get from user input in Shiny, but I kept getting dropped data when I added more than one state. I've got down to the subset function that is acting funky. I get no warning with just two states, but the third throws the exception. In every case, I'm limited to 250 obs. on the return. For example

temp<-subset(taxes, State==c("AL","MO",TX))

Warning message:1: In is.na(e1) | is.na(e2) : longer object length is not a multiple of shorter object length 2: In ==.default(State, c("AL", "MO", "TX")) : longer object length is not a multiple of shorter object length

I've tried other variables also with the same result

temp<-subset(taxes,StateFullName==c("Iowa","Missouri","Texas"))

Warning message: In StateFullName == c("Iowa", "Missouri", "Texas") : longer object length is not a multiple of shorter object length

Any ideas as to why I'm limited to 250 obs?

3

3 Answers

4
votes

You just need %in% to compare a vector of length > 1 i.e.

subset(taxes, State %in% c('AL', 'MO', 'TX'))
#   State amount
#4     MO  14143
#27    TX  11517
#30    AL  14465

Or using data.table, we convert the 'data.frame' to 'data.table' (setDT(taxes), set the key column as 'State' and extract the rows that have 'MO', 'TX', 'AL' in the 'State'.

library(data.table)
setDT(taxes, key='State')[c('MO', 'TX', 'AL')]
#    State amount
#1:    MO  14143
#2:    TX  11517
#3:    AL  14465

To understand why your code didn't work, let's check the logical vector output.

with(taxes, State==c('AL', 'MO', 'TX'))
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [49] FALSE FALSE

Warning messages: 1: In is.na(e1) | is.na(e2) : longer object length is not a multiple of shorter object length

None of the elements were TRUE for this example. The way it compares is based on the recycling. The first 3 elements of 'State' is compared with the vector 'AL', 'MO', and 'TX' in that order

 taxes$State[1:3] == c('AL', 'MO', 'TX')
 #[1] FALSE FALSE FALSE

Here, we compare element-by-element between corresponding values of both the vectors and as

  taxes$State[1:3]
  #[1] AK AL AR

is not matching the 'AL', 'MO', and 'TX' at the corresponding positions, it returns 'FALSE'.

The same way, it is compared up to the length of 'State' column, i.e. the next comparison is

 taxes$State[4:6] == c('AL', 'MO', 'TX')
 #[1] FALSE FALSE FALSE

Here also all are FALSE as the corresponding 'State' elements were 'AZ', 'CA', and 'CO'. We get a warning at the end because

 nrow(taxes)
 #[1] 50

50%%3!=0

If the nrow of the dataset is 51, the warning will not be there, but still as comparison is based on position, we may not the result as intended.

data

set.seed(24)
taxes <- data.frame(State=sample(state.abb), 
       amount=sample(400:20000, 50, replace=TRUE), stringsAsFactors=FALSE)
1
votes

The logical expression in the function is not according to what you want. It is looking for an observation that is equal to the vector c("AL","MO","TX"). As 12500 is a multiple of 2, subset tries subsetting considering both elements in the vector. As 12500 is not a multiple of 3, it can't use the same process when the vector has three elements, and throws the warning.

In short, an option to substitute the logical expression would be:

temp <- subset(taxes, State == "AL" | State == "MO" | State == "TX"))

This can be tested in this simple example:

df <- data.frame(x = c("A", "B", "A", "C", "D", "E", "A", "C"))
subset(df, x=="A" | x =="B" | x == "C")
0
votes

Yep. So what I didn't know or understand since I've never used a vector for subset was that subset sees c(a,b,c) as a sequence not a list for individual matches. Thanks all for the help!