0
votes

I am looking for a way to omit the rows which are not between two specific values, without using for loop. All rows in year column are between 1999 and 2002, however some of them do not include all years between these two dates. You can see the initial data as follows:

a <- data.frame(year = c(2000:2002,1999:2002,1999:2002,1999:2001), 
                id=c(4,6,2,1,3,5,7,4,2,0,-1,-3,4,3))

   year id
1  2000  4
2  2001  6
3  2002  2
4  1999  1
5  2000  3
6  2001  5
7  2002  7
8  1999  4
9  2000  2
10 2001  0
11 2002 -1
12 1999 -3
13 2000  4
14 2001  3

Processed dataset should only include consecutive rows between 1999:2002. The following data.frame is exactly what I need:

  year id
1 1999  1
2 2000  3
3 2001  5
4 2002  7
5 1999  4
6 2000  2
7 2001  0
8 2002 -1

When I execute the following for loop, I get previous data.frame without any problem:

for(i in 1:which(a$year == 2002)[length(which(a$year == 2002))]){
  if(a[i,1] == 1999 & a[i+3,1] == 2002){
    b <- a[i:(i+3),]
  }else{next}

  if(!exists("d")){
    d <- b
  }else{
    d <- rbind(d,b)
  }
}

However, I have more than 1 million rows and I need to do this process without using for loop. Is there any faster way for that?

3
Are there other columns in the data that ensure the row order? It seems like your data happen to be in order based on some type of grouping and that that grouping is important.hmhensen

3 Answers

0
votes

There is a function that can do this automatically.

First, install the package called dplyr or tidyverse with command install.packages("dplyr") or install.packages("tidyverse").

Then, load the package with library(dplyr).

Then, use the filter function: a_filtered = filter(a, year >=1999 & year < 2002).

This should be fast even there are many rows.

0
votes

You could try this. First we create groups of consecutive numbers, then we join with the full date range, then we filter if any group is not full. If you already have a grouping variable, this can be cut down a lot.

library(tidyverse)

df <- data_frame(year = c(2000:2002,1999:2002,1999:2002,1999:2001), 
                id=c(4,6,2,1,3,5,7,4,2,0,-1,-3,4,3))

df %>% 
  mutate(groups = cumsum(c(0,diff(year)!=1))) %>% 
  nest(-groups) %>%
  mutate(data = map(data, .f = ~full_join(.x, data_frame(year = 1999:2002), by = "year")),
         drop = map_lgl(data, ~any(is.na(.x$id)))) %>%
  filter(drop == FALSE) %>% 
  unnest() %>%
  select(-c(groups, drop))
#> # A tibble: 8 x 2
#>    year    id
#>   <int> <dbl>
#> 1  1999     1
#> 2  2000     3
#> 3  2001     5
#> 4  2002     7
#> 5  1999     4
#> 6  2000     2
#> 7  2001     0
#> 8  2002    -1

Created on 2018-08-31 by the reprex package (v0.2.0).

0
votes

We could also do this by creating a grouping column based on the logical expression checking the 'year' 1999, then filter by checking the first 'year' as '1999', last as '2002' and if all the 'year' in between are present for the particular 'grp'

library(dplyr)
a %>% 
  group_by(grp = cumsum(year == 1999)) %>%
  filter(dplyr::first(year) == 1999, 
         dplyr::last(year) == 2002, 
         all(1999:2002 %in% year)) %>%
  ungroup %>% # in case to remove the 'grp'
  select(-grp)
# A tibble: 8 x 2
#   year    id
#  <int> <dbl>
#1  1999     1
#2  2000     3
#3  2001     5
#4  2002     7
#5  1999     4
#6  2000     2
#7  2001     0
#8  2002    -1