57
votes

I'll illustrate my question with an example.

Sample data:

 df <- data.frame(ID = c(1, 1, 2, 2, 3, 5), A = c("foo", "bar", "foo", "foo", "bar", "bar"), B =     c(1, 5, 7, 23, 54, 202))

df
  ID   A   B
1  1 foo   1
2  1 bar   5
3  2 foo   7
4  2 foo  23
5  3 bar  54
6  5 bar 202

What I want to do is to summarize, by ID, the sum of B and the sum of B when A is "foo". I can do this in a couple steps like:

require(magrittr)
require(dplyr)

df1 <- df %>%
  group_by(ID) %>%
  summarize(sumB = sum(B))

df2 <- df %>%
  filter(A == "foo") %>%
  group_by(ID) %>%
  summarize(sumBfoo = sum(B))

left_join(df1, df2)

  ID sumB sumBfoo
1  1    6       1
2  2   30      30
3  3   54      NA
4  5  202      NA

However, I'm looking for a more elegant/faster way, as I'm dealing with 10gb+ of out-of-memory data in sqlite.

require(sqldf)
my_db <- src_sqlite("my_db.sqlite3", create = T)
df_sqlite <- copy_to(my_db, df)

I thought of using mutate to define a new Bfoo column:

df_sqlite %>%
  mutate(Bfoo = ifelse(A=="foo", B, 0))

Unfortunately, this doesn't work on the database end of things.

Error in sqliteExecStatement(conn, statement, ...) : 
  RS-DBI driver: (error in statement: no such function: IFELSE)
4
I believe the problem is that you are trying to mix character and numeric in Bfoo which is not possibletalat
@beginneR it's all numeric and the code runs fine locally...kevinykuo
What versions of R and dplry are you using? Your code ran fine for me locally as well (after I changed %>% to %.%).MrFlick
@MrFlick yeah it's supposed to run "locally" (by that I meant on a data.frame and not in the sqlite database). I'm trying to figure out how to best run it in sqlite, which doesn't recognize "ifelse"kevinykuo
Try using if(A=="foo") B else 0 - dplyr will try to convert that to a SQL case statement, which might work for you.hadley

4 Answers

107
votes

You can do both sums in a single dplyr statement:

df1 <- df %>%
  group_by(ID) %>%
  summarize(sumB = sum(B),
            sumBfoo = sum(B[A=="foo"]))

And here is a data.table version:

library(data.table)

dt = setDT(df) 

dt1 = dt[ , .(sumB = sum(B),
              sumBfoo = sum(B[A=="foo"])), 
          by = ID]

dt1
   ID sumB sumBfoo
1:  1    6       1
2:  2   30      30
3:  3   54       0
4:  5  202       0
33
votes

Writing up @hadley's comment as an answer

df_sqlite %>%
  group_by(ID) %>%
  mutate(Bfoo = if(A=="foo") B else 0) %>%
  summarize(sumB = sum(B),
            sumBfoo = sum(Bfoo)) %>%
  collect
14
votes

If you want to do counting instead of summarizing, then the answer is somewhat different. The change in code is small, especially in the conditional counting part.

df1 <- df %>%
    group_by(ID) %>%
    summarize(countB = n(),
              countBfoo = sum(A=="foo"))

df1
Source: local data frame [4 x 3]

  ID countB countBfoo
1  1      2         1
2  2      2         2
3  3      1         0
4  5      1         0
-1
votes

If you wanted to count the rows, instead of summing them, can you pass a variable to the function:

    df1 <- df %>%
group_by(ID) %>%
summarize(RowCountB = n(),
          RowCountBfoo = n(A=="foo"))

I get an error both with n() and nrow().