I have an example data.frame below. The original dataset has circa one million rows and 25 columns. I have a solution with a for loop but its just too slow. So I was wondering is there a way to do what I want to do with dplyr or data.table?
Example data.frame:
outcome leg arm head
1 0 1 0
1 1 0 1
1 0 1 1
0 1 0 0
0 0 1 0
0 0 1 1
1 1 0 0
0 1 0 1
1 0 0 0
0 0 0 0
0 1 0 0
1 0 1 0
0 1 0 0
0 0 0 1
1 1 1 0
0 1 1 0
Each row is an action. The outcome column is whether that action was successful or not. And each other column was whether that action was with a leg arm or head.
What I need is another column which gives me a percentage of all successful actions of that type of action. For example, the first row is a successful action with the arm. So count all actions in the entire dataset that were just with the arm and give me the % that were successful. Then input that new % successful value in a new column on that same row. So there were 3 total actions (rows) with an arm in the dataset and 2 were successful (outcome=1) so the new value on the same row would be 0.666. And our new row would look like the following:
outcome leg arm head successful
1 0 1 0 0.666
We would then move onto the second row which is an action with the leg (leg=1) and head (head=1) so we count all actions with the leg and head in the dataset and get the % of those that were successful....
Is there a way to achieve this for the entire dataset with either data.table or dplyr? I would presume data.table would be faster?