I know the title is kind of lame but I couldn't think of anything else to call this. I’m trying to subset a large data frame using the values that appear in the lon (longitude column). The current subsetting script I have works, and it creates subsets any time a -180 (the n/a value) appears, and includes the first non -180 number before and after one or more -180s is present. My problem is that I would like the subsets to be comprised of the 20 longitudes before any -180s, and 20 after. Since many of my files start with -180s and end with -180s this is creating and error. I just have no idea how to tell the script to subset -180s but also to ignore any that might appear in the first or last rows. Ideally the script would only subset -180s that have 20 longitudes before and 20 longitudes after them. Also, I will never know how many -180s might appear at the start and end of a file, which has been the biggest problem with figuring this out for myself. Below is a sample of my data and my current subsetting code. Thank you in advance for your help! Edit: It's also very important that the rows stay in the same order and are not sorted in any way as this is chronological data. And my data frame has 4461 rows and 7 columns. Edit: below is a small sample of my data frame.
cols <- structure(list(fixType = structure(c(39L, 39L, 39L, 39L, 39L, 39L, 39L, 39L, 39L, 39L), .Label = c("firstfix +indoors +startpoint", "firstfix +indoors +startpoint +cluster_center", "firstfix +indoors +stationary", "firstfix +indoors +stationary +cluster_center", "firstfix +invehicle +startpoint", "firstfix +invehicle +startpoint +cluster_center", "firstfix +invehicle +stationary +cluster_center", "firstfix +outdoors +startpoint", "firstfix +outdoors +startpoint +cluster_center", "firstfix +outdoors +stationary", "firstfix +outdoors +stationary +cluster_center", "inserted +indoors +midpoint", "inserted +indoors +pausepoint", "inserted +indoors +stationary", "inserted +indoors +stationary +cluster_center", "inserted +invehicle +midpoint", "inserted +invehicle +pausepoint", "inserted +invehicle +stationary", "inserted +invehicle +stationary +cluster_center", "inserted +outdoors +midpoint", "inserted +outdoors +pausepoint", "inserted +outdoors +stationary", "inserted +outdoors +stationary +cluster_center", "lastfix +indoors +endpoint", "lastfix +indoors +endpoint +cluster_center", "lastfix +indoors +stationary", "lastfix +indoors +stationary +cluster_center", "lastfix +invehicle +endpoint", "lastfix +invehicle +endpoint +cluster_center", "lastfix +outdoors +endpoint", "lastfix +outdoors +endpoint +cluster_center", "lastfix +outdoors +stationary", "lastvalidfix +indoors +stationary", "lastvalidfix +indoors +stationary +cluster_center", "lastvalidfix +invehicle +stationary", "lastvalidfix +invehicle +stationary +cluster_center", "lastvalidfix +outdoors +stationary", "lastvalidfix +outdoors +stationary +cluster_center", "unknown", "valid +indoors +endpoint", "valid +indoors +endpoint +cluster_center", "valid +indoors +midpoint", "valid +indoors +pausepoint", "valid +indoors +pausepoint +cluster_center", "valid +indoors +startpoint", "valid +indoors +startpoint +cluster_center", "valid +indoors +stationary", "valid +indoors +stationary +cluster_center", "valid +invehicle +endpoint", "valid +invehicle +endpoint +cluster_center", "valid +invehicle +midpoint", "valid +invehicle +pausepoint", "valid +invehicle +startpoint", "valid +invehicle +startpoint +cluster_center", "valid +invehicle +stationary", "valid +invehicle +stationary +cluster_center", "valid +outdoors +endpoint", "valid +outdoors +endpoint +cluster_center", "valid +outdoors +midpoint", "valid +outdoors +pausepoint", "valid +outdoors +pausepoint +cluster_center", "valid +outdoors +startpoint", "valid +outdoors +startpoint +cluster_center", "valid +outdoors +stationary", "valid +outdoors +stationary +cluster_center"), class = "factor"), lon = c(-180, -180, -180, -180, -180, -180, -180, -180, -180, -180), lat = c(-180, -180, -180, -180, -180, -180, -180, -180, -180, -180), activityIntensity = c(2L, 2L, 1L, 2L, 2L, 2L, 0L, 2L, 1L, 0L), Impute = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), ID = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 4352L, 4353L, 4354L), subsetNum = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("fixType", "lon", "lat", "activityIntensity", "Impute", "ID", "subsetNum"), row.names = c(4462L, 4463L, 4464L, 4465L, 4466L, 4467L, 4468L, 8813L, 8814L, 8815L), class = "data.frame")
Subset code:
set.seed(5)
n <- length(df) #make it equal to the length of whatever the input file is
impCols <- df[ , c("fixType", "lon", "lat", "activityIntensity", "Impute", "ID", "subsetNum")]
test.df <- data.frame(impCols)
df <- test.df
obs <- dim(df)[1]
counter <- 1
subM.List <- list()
start.idx <- NA
for(i in 1:obs){
if (is.na(start.idx) & (substr(df[i,"lon"], 1, 4) == -180)){
start.idx <- i-1
}
else if (!is.na(start.idx) & (substr(df[i,"lon"], 1, 4) != -180)){
end.idx <- i+1 #the plus one will give you the first two instances of signal instead of just the first
subMat <- df[start.idx:end.idx,]
subM.List[[counter]] <- subMat
start.idx <- NA
counter <- counter + 1
}
}
max(i, 20)
andmin(i, nrow(data))
to start/end the selection of your 20 rows. – Ellis Valentinerdiff(lon==180)
is your friend. If you could kindly post a reproducible example specifically for thedf
data, that would be very helpful. – Ricardo Saportalon
has neither 7 nor 4461 values. Is it a partial column from your dataset? In any case, either Ricardo'sdiff
or similarlywhich(df$lon !=180)
will get you started. Once you know the row indices of the transition from "180" to "not 180", it's easy to add or subtract 20 from those indices to get the subset you want. – Carl Witthoft