0
votes

I have a large dataset (> 3k rows) that I want to filter based on geographic location and date. The location filtering works fine but I get the following error message when using logical operators on Dates with filter (dplyr):

Error: level sets of factors are different

my current codes is as below:

head(master.data)
   State.Name County.Code Latitude Longitude Arithmetic.Mean Date.Local
1     Alabama           3 30.49748 -87.88026             8.0 2014-01-02
2     Alabama           3 30.49748 -87.88026             7.0 2014-01-05
3     Alabama           3 30.49748 -87.88026             7.0 2014-01-08
4     Alabama           3 30.49748 -87.88026             3.6 2014-01-11
5     Alabama           3 30.49748 -87.88026             5.2 2014-01-14
6     Alabama           3 30.49748 -87.88026             4.4 2014-01-17  

master.data$Date.Local <- as.Date(master.data$Date.Local, format = "%Y-%m-%d")

site.info <- data.frame("Alabama", 3, 30, 90, "28/12/2015", "13/07/2016")
names(site.info) <- c("State.Name", "County.Code", "Latitude", "Longitude", 
                       "Date.Start", "Date.End")
site.info$Date.Start <- as.Date(site.info$Date.Start, format = "%d/%m/%Y")
site.info$Date.End <- as.Date(site.info$Date.End, format = "%d/%m/%Y")

reduced.data <- filter(master.data, State.Name == site.info$State.Name, 
                       Date.Local >= site.info$Date.Start 
                       & Date.Local <= site.info$Date.End)

Both site.info and master.data have the dates formatted using as.Date. The input format is different because they are imported from external sources.

I am able to perform logical operations on the two with the expected results outside of filter. Not sure why this is the case. Using %in% yields the same results

Date.Local %in% c(site.info$Date.Start, site.info$Date.End)

How can I get this to work?

1
I think your State.Name == site.info$State.Name may be troubling if there are more number of elements in 'site.info' and is factor class. You may try a join. This can be done more easily with data.table i.e. setDT(master.data)[site.info, on = .(State.Name, Date.Local >= Date.Start, Date.Local <=Date.End)] - akrun
I'll try this too. site.info has no more than 12 rows and the dates are formatted using as.Date so I'm curious as to why it doesn't work. - Gautam
Your State.Name column is probably a factor - with different levels in each data.frame. If you include dput(head(master.data)) in you question we can know for certain. Unless you want this column as a factor, don't let it become one. EG use stringsAsFactors = FALSE in read.table - Richard Telford
@Richard You are correct, State.Name is indeed a factor. The output is too long to print here. Others are int' or num` except for Date.Local which is a date - Gautam
Tried using both methods but it doesn't work. Any other suggestions? - Gautam

1 Answers

0
votes

After much trying, it seems that subset works better than filter in this case:

 reduced.data <- subset(master.data, Latitude %in% closest.sites$Latitude
                        & Longitude %in% closest.sites$Longitude
                        & Date.Local >= site.info$Date.Start 
                        & Date.Local <= site.info$Date.End)

The above code gives me exactly the results I want.

Now I am stuck trying to filter all entries with the same time stamp - I want to filter all samples taken on the same day and find the average. Both subset and filter seem to fail in this case. R is woe.