1
votes

I have a dataset which looks somewhat like this-

Col1 Col2 Col3 Col4 Col5
400  322  345  1    1
131  345  809  1    1 
565  676  311  2    1
121  645  777  2    1
322  534  263  3    1
545  222  111  3    1

I want to perform a group-wise calculation where for each unique value in Col5, I calculate a statistic for Col1:Col3 grouping by Col4-

(X(i,j)-X'(i,j))/S(i)

where X(i,j) represents the mean of the variable for group i,j (Col5,Col4) ,X' represents the mean of the other groups j for the same variable, and S is the standard deviation over the entire group i. For example, in the above case, the statistic for Col1 based on group 1 in Col4 will be-

(mean(400,131)-mean(565,121,322,545))/stddev(Col1)
(265.5-388.25)/193.85 = -0.633

I want to use the summarise function with ddply to calculate this for each of the variables and for each of the groups in Col4 and Col5.

PS- I hope I've been able to explain the problem clearly.

Thanks!

2
@akrun I gave an example for one calculation, the output has to be the same calculation for all groups of Col4, and for variables Col1:Col3. I'm not able to calculate the output, the only way for me to show you would be a manual calculation.Mridul Garg
I get (mean(c(400, 131))-mean(c(565,121,322,545)))/sd(df1$Col1) #[1] -0.6332145 The sd for Col1 is sd(df1$Col1) #[1] 193.8522akrun
@akrun Oh, I'm new to stack-exchange and I didn't know about the accepting answer part. Thank you for letting me know, most of the questions I've asked have been answered. I'll make sure to do that now onwardsMridul Garg

2 Answers

1
votes

One option after grouping by 'Col4' and 'Col5' would be to get the sum of the whole column i.e. 'Col1' , subtract it from the sum of the elements of 'Col1' in each group, divide by the difference of the number of rows of the whole dataset with the group number of rows (n()) to get the mean of elements not in the group. Get the difference from the mean of the group and divide by the standard deviation of the entire column.

 library(dplyr)
 df1 %>% 
    group_by(Col4, Col5) %>% 
    summarise(ColN = (mean(Col1)-((sum(df1$Col1) - 
                   sum(Col1)))/(nrow(df1)-n()))/sd(df1$Col1) )

If we need for Col1:Col3, use summarise_each

 df1 %>% 
    group_by(Col4, Col5) %>%
    summarise_each(funs(((mean(.) - ((sum(df1$.)-sum(.))/(nrow(df1)- 
                      n())))/sd(df1$.))), Col1:Col3)
#  Col4  Col5       Col1       Col2       Col3
# <int> <int>      <dbl>      <dbl>      <dbl>
#1     1     1 -0.6332145 -0.9922312  0.7342422
#2     2     1 -0.0335307  1.6279003  0.5623983
#3     3     1  0.6667452 -0.6356690 -1.2966405
1
votes

Just for fun, here's an answer equivalent to @akrun's but without using dplyr:

a <- matrix(ncol=3, nrow=3)   
n <- 1
for(i in unique(df$Col5)){
 for(ii in unique(df$Col4[df$Col5 == i])){
  a[n,1] <- i
  a[n,2] <- ii
  a[n,3] <- (mean(df$Col1[df$Col4 == ii]) - mean(df$Col1[!df$Col4 == ii])) / sd(df$Col1)
  n      <- n + 1
 }
}

> a
     [,1] [,2]       [,3]
[1,]    1    1 -0.6332145
[2,]    1    2 -0.0335307
[3,]    1    3  0.6667452

This is for Col1. You can add the same thing for Col2 and Col3 with copy and paste.