0
votes

I have imported a table (Table1) in the following format:

ID    Name    Age   Start    Status1    Status2    Status3    Status4    Status5   Status6
01    John     45    2011         25         13         13         12         34        35
02    Mike     23    2017         12         15         18         45         32        25
03   Peter     34    2019         34         12         15         19         24        27
04   Sarah     21    2020         45         46         47         48         49        50

And I have generated another table (Table2) that looks like this:

ID    Name    Age   Start    Year    Value
02    Mike     23    2017       3       24
03   Peter     34    2019       1       11
03   Peter     34    2019       2       13
03   Peter     34    2019       3       13
04   Sarah     21    2020       1       42
04   Sarah     21    2020       2       43
04   Sarah     21    2020       3       44

Table2 contains predicted Values for the first 3 years after and including Start, BUT only if three years have not passed yet (as at the current year, 2019). So in the above example, ID 01 is already more than 3 years passed their Start value so have no predictions; ID 02 is two years passed their Start value so has one prediction; IDs 03 and 04 are zero years passed their Start value so have three predictions.

I need to transpose the Values from Table2 into Table1, replacing the Status values and shifting them to the right. The final table should look like this:

ID    Name    Age   Start    Status1    Status2    Status3    Status4    Status5   Status6
01    John     45    2011         25         13         13         12         34        35
02    Mike     23    2017         24         12         15         18         45        32
03   Peter     34    2019         11         13         13         34         12        15
04   Sarah     21    2020         42         43         44         45         46        47

As you can hopefully see, the row for ID 01 remains unchanged. For ID 02, the old Status values have shifted one to the right, and Status1 has been replaced. For IDs 03 and 04, the old Status values have shifted three places to the right, and Status1 to Status3 have been replaced. As you can see, the old values that were in Status4 to Status6 have been removed.

I am quite stuck with this. So far I have tried to make some rules around the replacements as follows:

1) If Year is more than 3 years from the current year (2019), then leave the Status column names as they are. e.g. row 1 of Table1.

2) If Start is 2019 or greater, rename the Status columns by adding 3. e.g in rows 3 and 4 of Table1, Status1 should be renamed Status4, and so on. The last 3 Status columns (Status4 to Status6) would be deleted.

3) If Start is within 3 years of 2019 (i.e. 2017 to 2019), then rename the Status columns by adding the difference between 2016 and Start. e.g. in row 2 of Table1, Status1 would be renamed Status2 and so on. The last column (Status6) would be deleted.

I could code these using if / else, but I'm not sure how to fit these within a data wrangling pipe with select() and left_join() to carry out these changes successfully. I am using R, but could use SQL too if it was easier. Any help would be much appreciated.

p.s. I appreciate that the resulting table is in a strange format, but unfortunately that is what I need to produce.

2
Where did 11 come from for Mike's new Status1?IceCreamToucan
@IceCreamToucan Sorry, that was a typo - I edited it now to fix it.rw2

2 Answers

2
votes

Here's a data.table solution. Most of the answer comes down to the merge_row function which takes a concatenated vector of the predicted values and the status values, removes NAs, then truncates the vector at 6 elements. It returns a list because that's what data.table expects on the RHS when using :=.


library(data.table)
set.seed(1234)

## Building data

table1 <- data.table(ID = 1:10,
                     Name = LETTERS[1:10],
                     Age = rpois(10, 40),
                     Start = sample(2015:2020, 10, rep = T),
                     Status1 = rpois(10, 10),
                     Status2 = rpois(10, 10),
                     Status3 = rpois(10, 10),
                     Status4 = rpois(10, 10),
                     Status5 = rpois(10, 10),
                     Status6 = rpois(10, 10))
res = list()
for(i in 1:nrow(table1)) {
  if(table1[i,Start] <= 2016) {
    res[[i]] <- NULL
  } else {
    y <- table1[i, Start]
    n_y <- min(3, 3 - (2019-y))
    res[[i]] <- table1[i,
           cbind(.SD[,.(ID, Name, Age, Start)], 
                 Year = seq_len(n_y),
                 Value = rpois(n_y, 20))]
  } 
}
table2 <- do.call(rbind, res)
merge_row <- function(x) {
  x = x[!is.na(x)][1:6]
  return(lapply(1:6, FUN = function(i) x[i]))
}

## Doing the merge

merged_table <- merge(dcast(table2, ID + Name + Age + Start ~ Year, value.var = 'Value'), 
                      table1, 
                      all = TRUE)[, 
                                  paste0('Status', 1:6) := 
                                    merge_row(c(`1`, `2`, `3`, Status1, 
                                                 Status2, Status3, Status4, 
                                                 Status5, Status6)),
                                  ID][,c('1', '2', '3') := NULL][]
merged_table
#>     ID Name Age Start Status1 Status2 Status3 Status4 Status5 Status6
#>  1:  1    A  32  2019      23      22      18       8      15       8
#>  2:  2    B  41  2018      29      17       8       5      11      10
#>  3:  3    C  42  2017      20      11      14      15      13      10
#>  4:  4    D  35  2018      21      19       7       6      12      16
#>  5:  5    E  43  2019      19      17      19       8      14      18
#>  6:  6    F  36  2016      10       8       9      11      12       6
#>  7:  7    G  46  2019      18      19      20      10      13       7
#>  8:  8    H  36  2016      12      14       8      10      11      11
#>  9:  9    I  36  2020      23      16      19      13       6      11
#> 10: 10    J  28  2017      19      20       8       9       7       9

Created on 2019-09-24 by the reprex package (v0.3.0)

1
votes

Another option is to rbind Table2 with a long format of Table1, take the top 6 values and then pivot into wide format again:

idcols <- c("ID","Name","Age","Start")
dcast(
    rbindlist(list(Table2,
        melt(Table1, id.vars=idcols, variable.name="Year", value.name="Value")))[,
            .(Status=paste0("Status", 1:6), Value=head(Value, 6)), idcols],
    as.formula(paste(paste(idcols, collapse=" + "), "~ Status")), value.var="Value")

output:

   ID  Name Age Start Status1 Status2 Status3 Status4 Status5 Status6
1:  1  John  45  2011      25      13      13      12      34      35
2:  2  Mike  23  2017      24      12      15      18      45      32
3:  3 Peter  34  2019      11      13      13      34      12      15
4:  4 Sarah  21  2020      42      43      44      45      46      47

data:

library(data.table)
Table1 <- fread("ID    Name    Age   Start    Status1    Status2    Status3    Status4    Status5   Status6
01    John     45    2011         25         13         13         12         34        35
02    Mike     23    2017         12         15         18         45         32        25
03   Peter     34    2019         34         12         15         19         24        27
04   Sarah     21    2020         45         46         47         48         49        50")
Table2 <- fread("ID    Name    Age   Start    Year    Value
02    Mike     23    2017       3       24
03   Peter     34    2019       1       11
03   Peter     34    2019       2       13
03   Peter     34    2019       3       13
04   Sarah     21    2020       1       42
04   Sarah     21    2020       2       43
04   Sarah     21    2020       3       44")