Summarize function for dplyr doesn't output correct results by row for multiple columns

Question

I have a dataset with 5 columns rachis1 to rachis5 being numeric. I have 100 rows of data with names attached to each row as a factor. I want to do a summary for each row for all five columns.

head(rl)
  name rachis1 rachis2 rachis3 rachis4 rachis5
1 R04-001     2.4     2.6     2.7     3.0     2.4
2 R04-002     7.0     7.4     7.7     6.8     7.4
3 R04-003     3.5     3.7     3.9     4.1     3.8
4 R04-004     9.5     9.1     7.8     8.8     8.2
5 R04-005     3.0     3.3     3.4     3.8     3.3
6 R04-006     9.2     9.8     9.5     9.4    10.1

My code for this is.

library(dplyr)
####Rachis
RL<- rl %>%
  group_by(name) %>% 
  summarize(RL= mean(rachis1:rachis5), RLMAX = max(rachis1:rachis5),RLMIN = 
  min(rachis1:rachis5), RLSTD=sd(rachis1:rachis5),na.rm=T)
head(RL)
tail(RL)

My resulting analysis comes out as...

 head(RL)
 # A tibble: 6 x 6
  name    RL RLMAX RLMIN     RLSTD na.rm
<fctr> <dbl> <dbl> <dbl>     <dbl> <lgl>

1  R04-001   2.4   2.4   2.4        NA  TRUE
2  R04-002   7.0   7.0   7.0        NA  TRUE
3  R04-003   3.5   3.5   3.5        NA  TRUE
4  R04-004   9.0   9.5   8.5 0.7071068  TRUE
5  R04-005   3.0   3.0   3.0        NA  TRUE
6  R04-006   9.2   9.2   9.2        NA  TRUE

I was wondering why there is NA in the RLSTD(standard deviations) and the min and max are not the mix and max of the row. Is there another way to gather my descriptive statistics?

Can you show what your data looks like at the start? My guess is your problem is your use of rachis1:rachis5, which will be an integer sequence from the rachis1 value to the rachis5 value. So if rachis1 is 4 and rachis5, is 6, then rachis1:rachis5 will be 4, 5, 6, the mean is 5, the min is 4 and the max is 6. Probably you should put your data in long format first... hard to know without seeing your data. See here for tips on making reproducible examples - using dput() to share data is very nice because it is copy/pasteable. — Gregor Thomas

Justin Justin · Accepted Answer · 2017-08-01T23:54:37

I can't tell if you have duplicate row names among the 100 rows. If you do, and as you already have the data in this format and are using the tidyverse, perhaps this may work. Notice I have placed the na.rm argument within the individual statistic function calls.

 RL<- rl %>%
      group_by(name) %>% 
              summarise(RL = mean(rachis1+rachis2+rachis3+rachis4+rachis5, na.rm=T),
                     RLMAX = max(rachis1+rachis2+rachis3+rachis4+rachis5, na.rm=T),
                     RLMIN = min(rachis1+rachis2+rachis3+rachis4+rachis5, na.rm=T),
                     RLSTD = sd(rachis1+rachis2+rachis3+rachis4+rachis5, na.rm=T))

Summarize function for dplyr doesn't output correct results by row for multiple columns

2 Answers