0
votes

I have an example data.frame below. The original dataset has circa one million rows and 25 columns. I have a solution with a for loop but its just too slow. So I was wondering is there a way to do what I want to do with dplyr or data.table?

Example data.frame:

  outcome leg arm head
      1    0   1   0
      1    1   0   1
      1    0   1   1
      0    1   0   0
      0    0   1   0
      0    0   1   1
      1    1   0   0
      0    1   0   1
      1    0   0   0 
      0    0   0   0
      0    1   0   0
      1    0   1   0
      0    1   0   0
      0    0   0   1
      1    1   1   0
      0    1   1   0

Each row is an action. The outcome column is whether that action was successful or not. And each other column was whether that action was with a leg arm or head.

What I need is another column which gives me a percentage of all successful actions of that type of action. For example, the first row is a successful action with the arm. So count all actions in the entire dataset that were just with the arm and give me the % that were successful. Then input that new % successful value in a new column on that same row. So there were 3 total actions (rows) with an arm in the dataset and 2 were successful (outcome=1) so the new value on the same row would be 0.666. And our new row would look like the following:

outcome leg arm head successful
   1     0   1   0      0.666

We would then move onto the second row which is an action with the leg (leg=1) and head (head=1) so we count all actions with the leg and head in the dataset and get the % of those that were successful....

Is there a way to achieve this for the entire dataset with either data.table or dplyr? I would presume data.table would be faster?

3

3 Answers

3
votes

You can do it with data.table by:

DT[, sum(outcome)/.N, by = .(leg, arm, head)]
#    leg arm head        V1
# 1:   0   1    0 0.6666667
# 2:   1   0    1 0.5000000
# 3:   0   1    1 0.5000000
# 4:   1   0    0 0.2500000
# 5:   0   0    0 0.5000000
# 6:   0   0    1 0.0000000
# 7:   1   1    0 0.5000000

or dplyr by:

DT %>% group_by(leg, arm, head) %>% summarize(successful = sum(outcome)/n())

# # A tibble: 7 x 4
# # Groups:   leg, arm [?]
#     leg   arm  head successful
#   <int> <int> <int>      <dbl>
# 1     0     0     0      0.500
# 2     0     0     1      0.   
# 3     0     1     0      0.667
# 4     0     1     1      0.500
# 5     1     0     0      0.250
# 6     1     0     1      0.500
# 7     1     1     0      0.500
2
votes

Here's how to do it with dplyr

df %>%
  group_by(arm, head, leg) %>%
  summarize_at(vars(outcome), funs(successful = mean)) %>%
  left_join(df, .)

The last line merges the summary into the original data set. I couldn't tell if that was what you were after or not, so if not, just skip the last line.

0
votes

Using dplyr:

data<-group_by_at(data, .vars=2:4) %>% mutate(successful=sum(outcome)/n())

the .vars can be replaced by whatever positions your columns are in, so it's useful if you have >20 columns to include.