0
votes

Background I am working with a large dataset from a repeated measures clinical trial in R, where I want to do some data manipulations for each subject. This could be extraction of the max value in column x for each subject or the mean of column y for each subject.

Problem

I am fond of using the dplyr package and pipes, which led me to the group_by function. But when I try to apply it, the data that I want to extract does not seem to group by subject as it is supposed to, but rather extracts data based on the entire dataset.

Code

This is what I have done so far:

data <- read.csv(file="group_by_question.csv", header=TRUE, sep=",")

library(dplyr)
library(plyr)

data <- tbl_df(data)

test <- data %>%
  filter(!is.na(wght)) %>%
  dplyr::group_by(subject_id) %>%
  mutate(maxwght=max(wght),meanwght=mean(wght)) %>%
  ungroup()

Sample of the test dataframe:

test dataframe

Find a .csv sample of my dataset here: https://drive.google.com/file/d/1wGkSQyJXqSswThiNsqC26qaP7d3catyX/view?usp=sharing

1
remove plyr from your work space and load only dplyr, as there is a lot of confilcts between them.A. Suliman
or load plyr then dplyr in that order.Jake Kaupp

1 Answers

0
votes

Is this what you want? In my example below, the output shows the max value for the maxwght column by subject id. You could replace max() with mean, for example, if you require the mean value for maxwght for each subject id.

library(dplyr)

data <- read.csv(file="group_by_question.csv", header=TRUE, sep=",")

test <- data %>%
    filter(!is.na(wght)) %>%
    mutate(maxwght=max(wght),meanwght=mean(wght)) %>%
    group_by(subject_id) %>%
    summarise(value = max(maxwght)) %>%
    ungroup()