How to count rows with conditional after grouping in data.table

Question

I have the following data frame:

dat <- read_csv(
  "s1,s2,v1,v2
   a,b,10,20
   a,b,22,NA
   a,b,13,33
   c,d,3,NA
   c,d,4.5,NA
   c,d,10,20"
)

dat
#> # A tibble: 6 x 4
#>      s1    s2    v1    v2
#>   <chr> <chr> <dbl> <int>
#> 1     a     b  10.0    20
#> 2     a     b  22.0    NA
#> 3     a     b  13.0    33
#> 4     c     d   3.0    NA
#> 5     c     d   4.5    NA
#> 6     c     d  10.0    20

What I want to do is

Filter row based on v1 values
Group by s1 and s2
Count total lines in every group
Count lines in every group where v2 is not NA.

For example with v1_filter >= 0 we get this:

s1 s2 total_line non_na_line
a  b     3          2
c  d     3          1

And with v1_filter >= 10 we get this:

s1 s2 total_line non_na_line
a  b     2          1
c  d     1          1

How can I achieve that with data.table or dplyr? In reality we have around ~31M rows in dat. So we need a fast method.

I'm stuck with this

 library(data.table)
 dat <- data.table(dat)

 v1_filter = 0
 dat[, v1 >= v1_filter, 
     by=list(s1,s2)]

This feels like it needs an example of what you've already tried and what went wrong, else it risks being a 'do my work' question. Expanding on dat %>% group_by(s1, s2) %>% count() is a simple dplyr answer, but I don't know how performant it is. — Jonathan Carroll
@JonathanCarroll Thanks but your dplyr doesn't give count for non_na_line — neversaint
DT[i,j,k] in data table stands for i for selection, j for operation , k for grouping. You have put selection in the second place , actually it should be first place, and second place should be count (.N) — Ajay Ohri
@SymbolixAU Thanks for pointing out. I modified to >=10 in my OP. — neversaint
@neversaint that's where the 'expanding on' comment is relevant. dat %>% group_by(s1, s2) %>% filter(!is.na(v2)) %>% count(). — Jonathan Carroll

Brian Brian · Accepted Answer · 2017-06-21T10:44:56

Using sum should help. Operating on a logical vector, it treats each TRUE as 1 and FALSE as 0, so you can easily do this:

dat %>%
    group_by(s1, s2) %>%
    summarise(total_lines = n(),
              non_na_line = sum(!is.na(v2)))

# A tibble: 2 x 4
# Groups:   s1 [?]
     s1    s2 total_lines non_na_line
  <chr> <chr>       <int>       <int>
1     a     b           3           2
2     c     d           3           1

You'll easily be able to add in a filter between group_by and summarise, to get what you want. Keep in mind that summarise will only retain columns that you group by.

Benchmark

For what it's worth, I ran a quick benchmark, with some test data of similar size as yours.

s1charMix <- rep(letters[seq(from = 1, to = 10)], length.out = 30000000)
s2charMix <- rep(letters[seq(from = 11, to = 20)], length.out = 30000000)
s1chars <- sample(s1charMix, 30000000)
s2chars <- sample(s2charMix, 30000000)
v1Nums <- runif(30000000, min = 0, max = 20)
nomissing <- sample(1:200000,1)
int.mix <- rbinom(30000000 - nomissing, 30, 0.3)
nalist <- rep(NA, nomissing)
v2NumsNA <- sample(x = c(int.mix, nalist), 30000000)
df <- data_frame(s1 = s1chars, s2 = s2chars, v1 = v1Nums, v2 = v2NumsNA)

This should roughly replicate the size and type of the data you suggest:

df

# A tibble: 30,000,000 x 4
      s1    s2         v1    v2
   <chr> <chr>      <dbl> <int>
 1     d     s  9.2123603     7
 2     b     q 16.6638639    11
 3     g     o 18.3682028    11
 4     g     s  0.8779067     9
 5     a     s  0.0719127    10
 6     b     q 16.8809193    12
 7     h     q 15.4382455     6
 8     e     k  2.3565489    11
 9     h     p 16.4508811     9
10     d     n  2.7283823    11
# ... with 29,999,990 more rows

df %>%
    filter(is.na(v2))

# A tibble: 116,924 x 4
      s1    s2         v1    v2
   <chr> <chr>      <dbl> <int>
 1     d     r 13.1448988    NA
 2     b     o  0.2703848    NA
 3     b     t 18.8319385    NA
 4     a     s 11.6448437    NA
 5     j     m  0.5388760    NA
 6     i     k  8.7098427    NA
 7     d     s  6.1149735    NA
 8     h     p  2.5552694    NA
 9     g     r  0.9057442    NA
10     b     s 19.8886830    NA
# ... with 116,914 more rows

Now, let's benchmark dplyr operations vs data.table:

### dplyr
df %>%
    filter(v1 > 10) %>%
    group_by(s1, s2) %>%
    summarise(total_lines = n(),
              non_na_line = sum(!is.na(v2)))

# A tibble: 100 x 4
# Groups:   s1 [?]
      s1    s2 total_lines non_na_line
   <chr> <chr>       <int>       <int>
 1     a     k      150327      149734
 2     a     l      149655      149062
 3     a     m      149794      149200
 4     a     n      149771      149197
 5     a     o      149495      148942
...
> system.time(df %>% filter(v1 > 10) %>% group_by(s1, s2) %>% summarise(total_lines = n(), non_na_line = sum(!is.na(v2))))
   user  system elapsed 
  1.848   0.420   2.290
> system.time(for (i in 1:100) df %>% filter(v1 > 10) %>% group_by(s1, s2) %>% summarise(total_lines = n(), non_na_line = sum(!is.na(v2))))
   user  system elapsed 
187.657  55.878 245.528 

### Data.table
library(data.table)
dat <- data.table(df)
> dat[v1 > 10, .N, by = .(s1, s2)][dat[v1 > 10 & !is.na(v2), .N, by = .(s1, s2)] , on = c("s1", "s2") , nomatch = 0]
 s1 s2      N    i.N
  1:  b  q 149968 149348
  2:  g  o 150411 149831
  3:  h  q 150132 149563
  4:  h  p 150786 150224
  5:  e  o 149951 149353
 ...
> system.time(dat[v1 > 10, .N, by = .(s1, s2)][dat[v1 > 10 & !is.na(v2), .N, by = .(s1, s2)] , on = c("s1", "s2") , nomatch = 0])
   user  system elapsed 
  2.027   0.228   2.271
> system.time(for (i in 1:100) dat[v1 > 10, .N, by = .(s1, s2)][dat[v1 > 10 & !is.na(v2), .N, by = .(s1, s2)] , on = c("s1", "s2") , nomatch = 0])
   user  system elapsed 
213.281  43.949 261.664

TL;DR dplyr and data.table are similarly fast, if anything dplyr is slightly faster

How to count rows with conditional after grouping in data.table

2 Answers

Benchmark