3
votes

I am new to R and am trying to collapse rows based on row values with dplyr. The following example shows the sample data.

set.seed(123)

df<-data.frame(A=c(rep(1:4,4)),
               B=runif(16,min=0,max=1),
               C=rnorm(16, mean=1,sd=0.5))

   A B          c
1  1 0.36647435 0.7485365
2  2 0.51864614 0.8654337
3  3 0.04596929 0.9858012
4  4 0.15479619 1.1294208
5  1 0.76712372 1.2460700
6  2 0.17666676 0.7402996
7  3 0.89759874 1.2699954
8  4 0.90267735 0.7101804
9  1 0.91744223 0.3451281
10 2 0.25472599 0.8604743
11 3 0.10933985 0.8696796
12 4 0.71656017 1.2648846
13 1 0.21157810 1.3170205
14 2 0.14947268 1.2789700
15 3 0.92251060 1.5696901
16 4 0.30090579 1.7642853

I want to summarize/collapse two rows based on the condition that the rows in column A with values 1 and 2 as one row (as mean of row 1 and 2) . Therefore the final result will have only 12 rows because the other 4 rows has been collapsed.

I tried to use the following dplyr function but to little avail.

install.packages ("tidyverse") library (tidyverse)

df %>% summarize_each( fun(i){ for i %in% c(1,2)funs(mean) })

The expected output is something like:

   A    B           C
1  1.5  0.4425602   0.8069851
3  3    0.04596929  0.9858012
4  4    0.15479619  1.1294208
5  1.5  0.4718952   0.9931848
7  3    0.89759874  1.2699954
8  4    0.90267735  0.7101804
9  1.5  0.5860841   0.6028012
11 3    0.10933985  0.8696796
12 4    0.71656017  1.2648846
13 1.5  0.1805254   1.297995
15 3    0.92251060  1.5696901
16 4    0.30090579  1.7642853

Thank you in advance.

2
the rows are not completely collapsed, they're collapsed 2 by 2 and you don't explain the rule, in your real data you really have sequences of 4 rows after another like this ?Moody_Mudskipper
@Moody_Mudskipper the rows are to be collapsed based if the values in column A is 1 and 2.G1124E
When generating random data, it would be helpful to include a set.seed() call in your example. That way others can replicate your results exactly!Mikko Marttila
I had read this, but it seems to work by groups of 4, which is the assumption that the chosen answer took, but not explicit in your questionMoody_Mudskipper

2 Answers

2
votes

By making the implicit, order based groupings explicit, the summary can be done with a single summarise_all call.

# Generate the data
set.seed(1)

df <- data.frame(
  A = c(rep(1:4, 4)), 
  B = runif(16, min = 0, max = 1), 
  C = rnorm(16, mean = 1, sd = 0.5)
)

library(dplyr)

new <- df %>%
  group_by(grp = rep(
    1:4,      # vector containing names of groups to create
    each = 4  # number of elements in each group
  )) %>% 
  group_by(mean_grp = cumsum(A > 2) + 1, add = T) %>%
  summarise_all(mean) %>%
  ungroup()

new
#> # A tibble: 12 x 5
#>      grp mean_grp     A         B           C
#>    <int>    <dbl> <dbl>     <dbl>       <dbl>
#>  1     1        1   1.5 0.3188163 1.067598241
#>  2     1        2   3.0 0.5728534 1.755890584
#>  3     1        3   4.0 0.9082078 1.194921618
#>  4     2        1   1.5 0.5500358 0.291014883
#>  5     2        2   3.0 0.9446753 1.562465459
#>  6     2        3   4.0 0.6607978 0.977533195
#>  7     3        1   1.5 0.3454502 1.231911487
#>  8     3        2   3.0 0.2059746 1.410610598
#>  9     3        3   4.0 0.1765568 1.296950661
#> 10     4        1   1.5 0.5355633 1.425278418
#> 11     4        2   3.0 0.7698414 1.037282492
#> 12     4        3   4.0 0.4976992 0.005324152

I would recommend keeping the grouping variables in your data after the summary (everything is simpler if you include them in the first place), but if you want to, you can drop them with new %>% select(-grp, -mean_grp).


PS. In order to avoid having "magic numbers" (such as the 1:4 and each = 4 when creating grp) included in the code, you could also create the first grouping variable as:

grp = cumsum(A < lag(A, default = A[1])) + 1

Assuming that the original data are ordered such that a new group starts each time the value of A is less than the previous value of A.

1
votes

One option would be to process the rows with A equal to 1 or 2 separately from the other rows and then bind them back together:

set.seed(3)
df<-data.frame(A=c(rep(1:4,4)),B=runif(16,min=0,max=1),c=rnorm(16, mean=1,sd=0.5))

df %>% 
  filter(A %in% 1:2) %>% 
  group_by(tmp=cumsum(A==1)) %>% 
  summarise_all(mean) %>% 
  ungroup %>% select(-tmp) %>% 
  bind_rows(df %>% filter(!A %in% 1:2))
       A         B         c
   <dbl>     <dbl>     <dbl>
 1   1.5 0.4877790 1.0121278
 2   1.5 0.6032474 0.8840735
 3   1.5 0.6042946 0.5996850
 4   1.5 0.5456424 0.6198039
 5   3.0 0.3849424 0.6276092
 6   4.0 0.3277343 0.4343907
 7   3.0 0.1246334 1.0760229
 8   4.0 0.2946009 0.8461718
 9   3.0 0.5120159 1.6121568
10   4.0 0.5050239 1.0999058
11   3.0 0.8679195 0.8981359
12   4.0 0.8297087 0.1667626