1
votes

I have a data table that looks like this:

ID    Start_Year    Payment_2010    Payment_2011    Payment_2012    Payment_2013
 1          2010              19              15              NA              11
 2          2012              22              24              26              23
 3          2012              NA              NA              NA              05   
 4          2011              17              NA              14              NA       

It carries on like this for many rows, with about 20 years of Payment columns in total.

I need to re-shape it, using R or SQL, with the current Payment columns replaced by ones in which the first column (Payment_1) represents the value in the Start_Year column, and the columns go on sequentially from there. For example, I would like to get the table above to look like this:

ID    Start_Year    Payment_1    Payment_2    Payment_3    Payment_4
 1          2010           19           15           NA           11
 2          2012           26           23           NA           NA
 3          2012           NA           05           NA           NA   
 4          2011           NA           14           NA           NA    

So, any Payment info in the first table that falls before the start year can be discarded. Where payment info is unavailable, the value should be NULL or NA (not zero, as these are different to NA in my data).

I started by trying to use dynamic SQL to create new columns based on the Start_year column, but got stuck. I've got closer by using spread and gather commands from the dplyr package in R, but my pipe keeps going wrong and I can't find a way to discard the Payment info that comes before the start_year value.

Any help much appreciated, many thanks in advance.

(I understand this is a strange format to want to have the data in, but that is what I need to do).

3

3 Answers

6
votes
library(tidyverse)
library(data.table) # or reshape2

df %>% 
  melt(c('ID', 'Start_Year')) %>% 
  mutate(variable = as.numeric(gsub('Payment_', '', variable)) - Start_Year + 1) %>% 
  filter(variable >= 1) %>% 
  dcast(ID + Start_Year ~ variable) %>% 
  rename_at(vars(-ID, -Start_Year), ~paste0("Payment_", .)) #credit @avid_useR

#   ID Start_Year Payment_1 Payment_2 Payment_3 Payment_4
# 1  1       2010        19        15        NA        11
# 2  2       2012        26        23        NA        NA
# 3  3       2012        NA         5        NA        NA
# 4  4       2011        NA        14        NA        NA

Data used:

structure(list(ID = 1:4, Start_Year = c(2010L, 2012L, 2012L, 
2011L), Payment_2010 = c(19L, 22L, NA, 17L), Payment_2011 = c(15L, 
24L, NA, NA), Payment_2012 = c(NA, 26L, NA, 14L), Payment_2013 = c(11L, 
23L, 5L, NA)), .Names = c("ID", "Start_Year", "Payment_2010", 
"Payment_2011", "Payment_2012", "Payment_2013"), row.names = c(NA, 
-4L), class = "data.frame")
3
votes

Similar idea as @Ryan's answer, but with only tidyervse:

library(tidyverse)

df %>%
  gather(Payment, value, -c(ID:Start_Year)) %>%
  filter(sub("Payment_", "", Payment) >= Start_Year) %>%
  group_by(ID) %>%
  mutate(Payment = seq(n())) %>%
  spread(Payment, value, sep = "_")

Result:

# A tibble: 4 x 6
# Groups:   ID [4]
     ID Start_Year Payment_1 Payment_2 Payment_3 Payment_4
  <int>      <int>     <int>     <int>     <int>     <int>
1     1       2010        19        15        NA        11
2     2       2012        26        23        NA        NA
3     3       2012        NA         5        NA        NA
4     4       2011        NA        14        NA        NA

Data:

df < -structure(list(ID = 1:4, Start_Year = c(2010L, 2012L, 2012L, 
2011L), Payment_2010 = c(19L, 22L, NA, 17L), Payment_2011 = c(15L, 
24L, NA, NA), Payment_2012 = c(NA, 26L, NA, 14L), Payment_2013 = c(11L, 
23L, 5L, NA)), .Names = c("ID", "Start_Year", "Payment_2010", 
"Payment_2011", "Payment_2012", "Payment_2013"), class = "data.frame", row.names = c(NA, 
-4L))
0
votes

Using data.table::shift to remove earlier years:

library(data.table)
dt <- fread("ID    Start_Year    Payment_2010    Payment_2011    Payment_2012    Payment_2013
 1          2010              19              15              NA              11
 2          2012              22              24              26              23
 3          2012              NA              NA              NA              05   
 4          2011              17              NA              14              NA")

dtans <- dcast(
    melt(dt, id.vars=c("ID", "Start_Year"))[, 
        value := shift(value, grep(Start_Year[1L], variable) - 1L, type="lead"), by=.(ID)],
    ID + Start_Year ~ variable)

#label columns according to OP
cols <- grep("^Payment", names(dtans), value=TRUE)
setnames(dtans, cols, paste0("Payment_", seq_len(length(cols)))) 
dtans

output:

   ID Start_Year Payment_1 Payment_2 Payment_3 Payment_4
1:  1       2010        19        15        NA        11
2:  2       2012        26        23        NA        NA
3:  3       2012        NA         5        NA        NA
4:  4       2011        NA        14        NA        NA

Or using base R without converting data into long format:

mat <- read.table(text="ID    Start_Year    Payment_2010    Payment_2011    Payment_2012    Payment_2013
 1          2010              19              15              NA              11
 2          2012              22              24              26              23
 3          2012              NA              NA              NA              05   
 4          2011              17              NA              14              NA", header=TRUE)
ans <- t(apply(mat, 1, function(x) {
    n <- grep(x["Start_Year"], names(mat)) - 1L
    c(x[1:2], x[-seq_len(n)], rep(NA_integer_, n - 2L))
}))

#label rows and columns according to OP
rownames(ans) <- rownames(mat)
colnames(ans) <- c(colnames(mat)[1:2], paste0("Payment_", seq_len(ncol(mat)-2)))
ans

output:

  ID Start_Year Payment_1 Payment_2 Payment_3 Payment_4
1  1       2010        19        15        NA        11
2  2       2012        26        23        NA        NA
3  3       2012        NA         5        NA        NA
4  4       2011        NA        14        NA        NA