Summarising a data frame by row based on certain conditions with dplyr or data.table

Question

I have an example data.frame below. The original dataset has circa one million rows and 25 columns. I have a solution with a for loop but its just too slow. So I was wondering is there a way to do what I want to do with dplyr or data.table?

Example data.frame:

  outcome leg arm head
      1    0   1   0
      1    1   0   1
      1    0   1   1
      0    1   0   0
      0    0   1   0
      0    0   1   1
      1    1   0   0
      0    1   0   1
      1    0   0   0 
      0    0   0   0
      0    1   0   0
      1    0   1   0
      0    1   0   0
      0    0   0   1
      1    1   1   0
      0    1   1   0

Each row is an action. The outcome column is whether that action was successful or not. And each other column was whether that action was with a leg arm or head.

What I need is another column which gives me a percentage of all successful actions of that type of action. For example, the first row is a successful action with the arm. So count all actions in the entire dataset that were just with the arm and give me the % that were successful. Then input that new % successful value in a new column on that same row. So there were 3 total actions (rows) with an arm in the dataset and 2 were successful (outcome=1) so the new value on the same row would be 0.666. And our new row would look like the following:

outcome leg arm head successful
   1     0   1   0      0.666

We would then move onto the second row which is an action with the leg (leg=1) and head (head=1) so we count all actions with the leg and head in the dataset and get the % of those that were successful....

Is there a way to achieve this for the entire dataset with either data.table or dplyr? I would presume data.table would be faster?

mt1022 mt1022 · Accepted Answer · 2018-06-22T13:31:01

You can do it with data.table by:

DT[, sum(outcome)/.N, by = .(leg, arm, head)]
#    leg arm head        V1
# 1:   0   1    0 0.6666667
# 2:   1   0    1 0.5000000
# 3:   0   1    1 0.5000000
# 4:   1   0    0 0.2500000
# 5:   0   0    0 0.5000000
# 6:   0   0    1 0.0000000
# 7:   1   1    0 0.5000000

or dplyr by:

DT %>% group_by(leg, arm, head) %>% summarize(successful = sum(outcome)/n())

# # A tibble: 7 x 4
# # Groups:   leg, arm [?]
#     leg   arm  head successful
#   <int> <int> <int>      <dbl>
# 1     0     0     0      0.500
# 2     0     0     1      0.   
# 3     0     1     0      0.667
# 4     0     1     1      0.500
# 5     1     0     0      0.250
# 6     1     0     1      0.500
# 7     1     1     0      0.500

Summarising a data frame by row based on certain conditions with dplyr or data.table

3 Answers