0
votes

I am trying to reorganize some data from long to wide format.

There are many individuals (MRNs), each of whom had sequencing done a different number of times (seq_date) and I want to create a data frame that shows the change in the Val over time.

The data frame I'm starting with looks like this:

dat_data <- data.frame(
  MRN = c("012345", "012345", "012345", "012345", "012345", "012345"),
  seq_date = c("1-Aug-18", "27-Mar-19", "27-Mar-19", "27-Mar-19", "7-May-19", "7-May-19"),
  Gene = c("SRSF2", "TET2", "IDH1", "SRSF2", "IDH1", "SRSF2"),
  AA = c("p.A2B", "p.C2D", "p.E2F", "p.A2B", "p.E2F", "p.A2B"),
  Val = c("0.1", "0.2", "0.3", "0.4", "0.5", "0.6")
)

> dat_data
     MRN  seq_date  Gene    AA Val
1 012345  1-Aug-18 SRSF2 p.A2B 0.1
2 012345 27-Mar-19  TET2 p.C2D 0.2
3 012345 27-Mar-19  IDH1 p.E2F 0.3
4 012345 27-Mar-19 SRSF2 p.A2B 0.4
5 012345  7-May-19  IDH1 p.E2F 0.5
6 012345  7-May-19 SRSF2 p.A2B 0.6

I'm trying to figure out how to add a column containing D1, D2, D3, etc. based on when the samples were collected:

     MRN  seq_date  Gene    AA Val ord_date
1 012345  1-Aug-18 SRSF2 p.A2B 0.1       D1
2 012345 27-Mar-19  TET2 p.C2D 0.2       D2
3 012345 27-Mar-19  IDH1 p.E2F 0.3       D2
4 012345 27-Mar-19 SRSF2 p.A2B 0.4       D2
5 012345  7-May-19  IDH1 p.E2F 0.5       D3
6 012345  7-May-19 SRSF2 p.A2B 0.6       D3

and subsequently use gather / spread to create a wide-format data frame that looks like this:

     MRN  Gene    AA  D1  D2  D3
1 012345 SRSF2 p.A2B 0.1 0.4 0.6
2 012345  IDH1 p.E2F   0 0.3 0.5
3 012345  TET2 p.C2D   0 0.2   0

I am most familiar w/ dplyr and have been trying mutate = case_when for the first step and gather / spread for the second, without success. Would greatly appreciate any help.

1

1 Answers

0
votes

I grouped by date to identify the ordering and then used pivot_wider from tidyr to spread.

dat_data %>%
  mutate(
    sdate = lubridate::dmy(seq_date), # in case dates aren't in order
    Val = as.numeric(as.character(Val)) # convert factor to numeric
  ) %>%
  group_by(sdate) %>%
  mutate(
    ord_date = paste0('D',group_indices()) # Creates D1, D2, etc
  ) %>%
  pivot_wider(
    id_cols = c(MRN,Gene,AA),
    names_from = ord_date,
    values_from = Val,
    values_fill = list(Val = 0) # fills missings with 0 instead of NA
  )

# A tibble: 3 x 6
  MRN    Gene  AA       D1    D2    D3
  <fct>  <fct> <fct> <dbl> <dbl> <dbl>
1 012345 SRSF2 p.A2B   0.1   0.4   0.6
2 012345 TET2  p.C2D   0     0.2   0  
3 012345 IDH1  p.E2F   0     0.3   0.5