0
votes

enter image description here

Hi, I have this dataset with unique Id variables in column A and then subsequent renal scans for each patient. It is a csv file and I want to reshape it to long format using R if possible. Each participant can have a number of renal scans ranging from 1-17.

Also some Id are listed as 'No' for not receiving scan. I want it to be reshaped to something similar to this

enter image description here

I know previous questions on this organize by year, I have scans from participants that occur multiple times during the year date format yyyy-mm-dd

Please see data below

structure(list(id = c(1010001, 1010002, 1010004, 1010005, 1010006, 
1010007), `GFR Scans?` = c("Yes", "Yes", "Yes", "Yes", "Yes", 
"No"), `1. Date of renal scan:` = structure(c(1133913600, 1196812800, 
1237334400, 1124150400, 1192060800, NA), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), `1. Type of renal scan:` = c("DTPA", 
"DTPA", "DTPA", "DTPA", "DTPA", NA), `1. GFR mL/1.73 sq.m` = c(18, 
13, 68, 117, 46, NA), `1. Pre/Post tx?` = c("Pre", "Pre", "Post", 
"Post", "Pre", NA), `2. Date of renal scan:` = structure(c(1146528000, 
1214524800, NA, 1151366400, 1245974400, NA), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), `2. Type of renal scan:` = c("DTPA", 
"DTPA", NA, "DTPA", "DTPA", NA), `2. GFR mL/1.73 sq.m` = c(86, 
110, NA, 148, 123, NA), `2. Pre/Post tx?` = c("Post", "Post", 
NA, "Post", "Post", NA), `3. Date of renal scan:` = structure(c(NA, 
1219104000, NA, 1184025600, NA, NA), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), `3. Type of renal scan:` = c(NA, "DTPA", NA, 
"DTPA", NA, NA), `3. GFR mL/1.73 sq.m` = c(NA, 92, NA, 166, NA, 
NA), `3. Pre/Post tx?` = c(NA, "Post", NA, "Post", NA, NA), `4. Date of    renal scan:` = structure(c(NA, 
1242691200, NA, 1213660800, NA, NA), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), `4. Type of renal scan:` = c(NA, "DTPA", NA, 
"DTPA", NA, NA), `4. GFR mL/1.73 sq.m` = c(NA, 36, NA, 171, NA, 
NA), `4. Pre/Post tx?` = c(NA, "Post", NA, "Post", NA, NA), `5. Date of    renal scan:` = structure(c(NA, 
NA, NA, 1288656000, NA, NA), class = c("POSIXct", "POSIXt"), tzone =  "UTC"), 
    `5. Type of renal scan:` = c(NA, NA, NA, "DTPA", NA, NA), 
    `5. GFR mL/1.73 sq.m` = c(NA, NA, NA, 105, NA, NA), `5. Pre/Post  tx?` = c(NA, 
    NA, NA, "Post", NA, NA), `6. Date of renal scan:` = structure(c(NA, 
    NA, NA, 1323129600, NA, NA), class = c("POSIXct", "POSIXt"
    ), tzone = "UTC"), `6. Type of renal scan:` = c(NA, NA, NA, 
    "DTPA", NA, NA), `6. GFR mL/1.73 sq.m` = c(NA, NA, NA, 103, 
    NA, NA), `6. Pre/Post tx?` = c(NA, NA, NA, "Post", NA, NA
    ), `7. Date of renal scan:` = structure(c(NA, NA, NA, 1355184000, 
    NA, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    `7. Type of renal scan:` = c(NA, NA, NA, "DTPA", NA, NA), 
    `7. GFR mL/1.73 sq.m` = c(NA, NA, NA, 98, NA, NA), `7. Pre/Post tx?` = c(NA, 
    NA, NA, "Post", NA, NA), `8. Date of renal scan:` = c(NA, 
    NA, NA, NA, NA, NA), `8. Type of renal scan:` = c(NA, NA, 
    NA, NA, NA, NA), `8. GFR mL/1.73 sq.m` = c(NA, NA, NA, NA, 
    NA, NA), `8. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA), `9. Date of renal scan:` = c(NA, 
    NA, NA, NA, NA, NA), `9. Type of renal scan:` = c(NA, NA, 
    NA, NA, NA, NA), `9. GFR mL/1.73 sq.m` = c(NA, NA, NA, NA, 
    NA, NA), `9. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA), `10. Date   of renal scan:` = c(NA, 
    NA, NA, NA, NA, NA), `10. Type of renal scan:` = c(NA, NA, 
    NA, NA, NA, NA), `10. GFR mL/1.73 sq.m` = c(NA, NA, NA, NA, 
    NA, NA), `10. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA), 
    `11. Date of renal scan:` = c(NA, NA, NA, NA, NA, NA), `11. Type of  renal scan:` = c(NA, 
    NA, NA, NA, NA, NA), `11. GFR mL/1.73 sq.m` = c(NA, NA, NA, 
    NA, NA, NA), `11. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA
    ), `12. Date of renal scan:` = c(NA, NA, NA, NA, NA, NA), 
    `12. Type of renal scan:` = c(NA, NA, NA, NA, NA, NA), `12. GFR mL/1.73 sq.m` = c(NA, 
    NA, NA, NA, NA, NA), `12. Pre/Post tx?` = c(NA, NA, NA, NA, 
    NA, NA), `13. Date of renal scan:` = c(NA, NA, NA, NA, NA, 
    NA), `13. Type of renal scan:` = c(NA, NA, NA, NA, NA, NA
    ), `13. GFR mL/1.73 sq.m` = c(NA, NA, NA, NA, NA, NA), `13. Pre/Post tx?` = c(NA, 
    NA, NA, NA, NA, NA), `14. Date of renal scan:` = c(NA, NA, 
    NA, NA, NA, NA), `14. Type of renal scan:` = c(NA, NA, NA, 
    NA, NA, NA), `14. GFR mL/1.73 sq.m` = c(NA, NA, NA, NA, NA, 
    NA), `14. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA), `15. Date of renal scan:` = c(NA, 
    NA, NA, NA, NA, NA), `15. Type of renal scan:` = c(NA, NA, 
    NA, NA, NA, NA), `15. GFR mL/1.73 sq.m` = c(NA, NA, NA, NA, 
    NA, NA), `15. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA), 
    `16. Date of renal scan:` = c(NA, NA, NA, NA, NA, NA), `16. Type of  renal scan:` = c(NA, 
    NA, NA, NA, NA, NA), `16. GFR mL/1.73 sq.m` = c(NA, NA, NA, 
    NA, NA, NA), `16. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA
    ), `17. Date of renal scan:` = c(NA, NA, NA, NA, NA, NA), 
    `17. Type of renal scan:` = c(NA, NA, NA, NA, NA, NA), `17. GFR mL/1.73 sq.m` = c(NA, 
    NA, NA, NA, NA, NA), `17. Pre/Post tx?` = c(NA, NA, NA, NA, 
    NA, NA)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))

The first image is the original wide format, second image is what I am trying to get. None of the other wide to long answers on this have helped me as I have multiple columns involved.

e.g. id 1010001 has had two scans, I need this listed one after each other and not beside each other (see image two).

I would really appreciate your help.

2
So the idea is to sort the table into ID and shift the second and third groupings into the first?bob1
Yes, Group by ID and then list the subsequent scans underneath instead of side by side. Some of the ID have up to 17 scans (columns to the side).user8722680
Also some IDs didn't receive any scans - listed as No. These also need to be listed, with only one row for these as the have no subsequent linked columnsuser8722680

2 Answers

2
votes

This question has been asked several times before, e.g., Reshaping multiple sets of measurement columns (wide format) into single columns (long format). One possible approach is to use data.table's melt() function which is able to reshape multiple value columns simultaneously.

However, here is an additional difficulty which justifies an answer on its own, IMHO. The column names occassionally contain surplus white space which needs to be removed beforehand in order to have a consistent naming pattern for the columns.

names(df1)
 [1] "id"                        "GFR Scans?"                "1. Date of renal scan:"    "1. Type of renal scan:"   
 [5] "1. GFR mL/1.73 sq.m"       "1. Pre/Post tx?"           "2. Date of renal scan:"    "2. Type of renal scan:"   
 [9] "2. GFR mL/1.73 sq.m"       "2. Pre/Post tx?"           "3. Date of renal scan:"    "3. Type of renal scan:"   
[13] "3. GFR mL/1.73 sq.m"       "3. Pre/Post tx?"           "4. Date of    renal scan:" "4. Type of renal scan:"   
[17] "4. GFR mL/1.73 sq.m"       "4. Pre/Post tx?"           "5. Date of    renal scan:" "5. Type of renal scan:"   
[21] "5. GFR mL/1.73 sq.m"       "5. Pre/Post  tx?"          "6. Date of renal scan:"    "6. Type of renal scan:"   
[25] "6. GFR mL/1.73 sq.m"       "6. Pre/Post tx?"           "7. Date of renal scan:"    "7. Type of renal scan:"   
[29] "7. GFR mL/1.73 sq.m"       "7. Pre/Post tx?"           "8. Date of renal scan:"    "8. Type of renal scan:"   
[33] "8. GFR mL/1.73 sq.m"       "8. Pre/Post tx?"           "9. Date of renal scan:"    "9. Type of renal scan:"   
[37] "9. GFR mL/1.73 sq.m"       "9. Pre/Post tx?"           "10. Date   of renal scan:" "10. Type of renal scan:"  
[41] "10. GFR mL/1.73 sq.m"      "10. Pre/Post tx?"          "11. Date of renal scan:"   "11. Type of  renal scan:" 
[45] "11. GFR mL/1.73 sq.m"      "11. Pre/Post tx?"          "12. Date of renal scan:"   "12. Type of renal scan:"  
[49] "12. GFR mL/1.73 sq.m"      "12. Pre/Post tx?"          "13. Date of renal scan:"   "13. Type of renal scan:"  
[53] "13. GFR mL/1.73 sq.m"      "13. Pre/Post tx?"          "14. Date of renal scan:"   "14. Type of renal scan:"  
[57] "14. GFR mL/1.73 sq.m"      "14. Pre/Post tx?"          "15. Date of renal scan:"   "15. Type of renal scan:"  
[61] "15. GFR mL/1.73 sq.m"      "15. Pre/Post tx?"          "16. Date of renal scan:"   "16. Type of  renal scan:" 
[65] "16. GFR mL/1.73 sq.m"      "16. Pre/Post tx?"          "17. Date of renal scan:"   "17. Type of renal scan:"
library(data.table)
library(magrittr)
# clean up column names: remove surplus whitespace
setDT(df1) %>% setnames(names(.) %>% stringr::str_replace_all("\\s+", " "))
# get name pattern for subsequent melt
cols <- names(df1)[3:6] %>% stringr::str_replace("1. ", "")
# reshape multiple columns from wide to long
long <- melt(df1, measure.vars = patterns(cols), value.name = cols, na.rm = TRUE)[
  # recreate lost POSIXct attribute
  , `Date of renal scan:` := lubridate::as_datetime(`Date of renal scan:`)][]

long
         id GFR Scans? variable Date of renal scan: Type of renal scan: GFR mL/1.73 sq.m Pre/Post tx?
 1: 1010001        Yes        1          2005-12-07                DTPA               18          Pre
 2: 1010002        Yes        1          2007-12-05                DTPA               13          Pre
 3: 1010004        Yes        1          2009-03-18                DTPA               68         Post
 4: 1010005        Yes        1          2005-08-16                DTPA              117         Post
 5: 1010006        Yes        1          2007-10-11                DTPA               46          Pre
 6: 1010001        Yes        2          2006-05-02                DTPA               86         Post
 7: 1010002        Yes        2          2008-06-27                DTPA              110         Post
 8: 1010005        Yes        2          2006-06-27                DTPA              148         Post
 9: 1010006        Yes        2          2009-06-26                DTPA              123         Post
10: 1010002        Yes        3          2008-08-19                DTPA               92         Post
11: 1010005        Yes        3          2007-07-10                DTPA              166         Post
12: 1010002        Yes        4          2009-05-19                DTPA               36         Post
13: 1010005        Yes        4          2008-06-17                DTPA              171         Post
14: 1010005        Yes        5          2010-11-02                DTPA              105         Post
15: 1010005        Yes        6          2011-12-06                DTPA              103         Post
16: 1010005        Yes        7          2012-12-11                DTPA               98         Post

In the call to melt(), we can set the parameter na.rm = FALSE to keep all data:

          id GFR Scans? variable Date of renal scan: Type of renal scan: GFR mL/1.73 sq.m Pre/Post tx?
  1: 1010001        Yes        1          2005-12-07                DTPA               18          Pre
  2: 1010002        Yes        1          2007-12-05                DTPA               13          Pre
  3: 1010004        Yes        1          2009-03-18                DTPA               68         Post
  4: 1010005        Yes        1          2005-08-16                DTPA              117         Post
  5: 1010006        Yes        1          2007-10-11                DTPA               46          Pre
 ---                                                                                                  
 98: 1010002        Yes       17                <NA>                <NA>               NA         <NA>
 99: 1010004        Yes       17                <NA>                <NA>               NA         <NA>
100: 1010005        Yes       17                <NA>                <NA>               NA         <NA>
101: 1010006        Yes       17                <NA>                <NA>               NA         <NA>
102: 1010007         No       17                <NA>                <NA>               NA         <NA>
1
votes

Here is a working solution, not the best but working. The strategy is to go from wide to long and then to tidy format.

When going from the original wide to long format, all of the columns were converted to the lowest common form which is character in this case thus requiring column conversion at the end.

In order to remove the rows with NA, I use complete.cases thus your last id 1010007 was not in the final output. If this is a problem, you should adjust the position of the NA clean up step.

library(tidyr)
library(dplyr)

#convert from wide to long
new<-gather(df,key = "key", value = "value", -id, -`GFR Scans?`)
#clean up the key column
new$key<-sub("[0-9]+\\. ", "", new$key)
new$key<-gsub("[ ]+", " ", new$key)

# verify column headings (should only be 4)
unique(new$key)
#remove the rows with NA
new<-new[complete.cases(new),]

#now go from long to slightly wide
answer<-new %>% group_by( id, `GFR Scans?`, key) %>% mutate(testnum=row_number()) %>% spread(key, value)  

#convert the colmns back to the proper type
answer$`Date of renal scan:`<-as.POSIXct(as.numeric(answer$`Date of renal scan:`), origin="1970-01-01", tz="UTC")
answer$`GFR mL/1.73 sq.m`<-as.numeric(answer$`GFR mL/1.73 sq.m`)
answer

# id `GFR Scans?` testnum `Date of renal scan:` `GFR mL/1.73 sq.m` `Pre/Post tx?` `Type of renal scan:`
#     <dbl> <chr>          <int> <dttm>                             <dbl> <chr>          <chr>                
# 1 1010001 Yes                1 2005-12-07 00:00:00                   18 Pre            DTPA                 
# 2 1010001 Yes                2 2006-05-02 00:00:00                   86 Post           DTPA                 
# 3 1010002 Yes                1 2007-12-05 00:00:00                   13 Pre            DTPA                 
# 4 1010002 Yes                2 2008-06-27 00:00:00                  110 Post           DTPA                 
# 5 1010002 Yes                3 2008-08-19 00:00:00                   92 Post           DTPA                 
# 6 1010002 Yes                4 2009-05-19 00:00:00                   36 Post           DTPA                 
# 7 1010004 Yes                1 2009-03-18 00:00:00                   68 Post           DTPA                 
# 8 1010005 Yes                1 2005-08-16 00:00:00                  117 Post           DTPA