4
votes

i have the following list and I would want to add a new row before each group of ID's preserving the ID and setting the A and B to 1.00.

       ID      DATEE       A      B 
   102984 2016-11-23      2.0    2.0
   140349 2016-11-23      1.5    1.5
   167109 2017-04-16      2.0    2.0
   167109 2017-06-21      1.5    1.5

The end result:

  ID      DATEE           A      B     
  102984    NA           1.0    1.0
  102984 2016-11-23      2.0    2.0       
  140349    NA           1.0    1.0      
  140349 2016-11-23      1.5    1.5
  167109    NA           1.0    1.0             
  167109 2017-04-16      2.0    2.0       
  167109 2017-06-21      1.5    1.5       

Up until now I used the following code which adds an empty row at the bottom of each group do.call(rbind, by(df,df$ID,rbind,"")) however I couldn't introduce the specific values in their respective columns when I substituted "" by a vector of values.

6
Related stackoverflow.com/q/27730389 and maybe consider eddi's comment here: stackoverflow.com/questions/16652533/…Frank

6 Answers

7
votes

Here is one option with tidyverse. We get the distinct rows of dataset by 'ID', mutate the variables 'A', 'B' to 1, and 'DATEE' to NA, then with bind_rows row bind with the original dataset and arrange by 'ID'

library(tidyverse)
df1 %>%
  distinct(ID, .keep_all= TRUE) %>%
  mutate_at(vars("A", "B"), funs((1))) %>% 
  mutate(DATEE = NA) %>%
  bind_rows(., df1) %>%
  arrange(ID)
#     ID      DATEE   A   B
#1 102984       <NA> 1.0 1.0
#2 102984 2016-11-23 2.0 2.0
#3 140349       <NA> 1.0 1.0
#4 140349 2016-11-23 1.5 1.5
#5 167109       <NA> 1.0 1.0
#6 167109 2017-04-16 2.0 2.0
#7 167109 2017-06-21 1.5 1.5

(I'll assume the date formatting has been fixed, e.g., with df1$DATEE = as.Date(df1$DATEE).)


Or translated to base R:

new1 = data.frame(ID = unique(df1$ID), DATEE = Sys.Date()[NA_integer_], A = 1, B = 1)
tabs = list(new1, df1)
res  = do.call(rbind, tabs)
res <- res[order(res$ID), ]

#       ID      DATEE   A   B
# 1 102984       <NA> 1.0 1.0
# 4 102984 2016-11-23 2.0 2.0
# 2 140349       <NA> 1.0 1.0
# 5 140349 2016-11-23 1.5 1.5
# 3 167109       <NA> 1.0 1.0
# 6 167109 2017-04-16 2.0 2.0
# 7 167109 2017-06-21 1.5 1.5

Or with data.table:

library(data.table)
new1 = data.table(ID = unique(df1$ID), DATEE = Sys.Date()[NA_integer_], A = 1, B = 1)
tabs = list(new1, df1)
res  = rbindlist(tabs)
setorder(res)

#       ID      DATEE   A   B
#1: 102984       <NA> 1.0 1.0
#2: 102984 2016-11-23 2.0 2.0
#3: 140349       <NA> 1.0 1.0
#4: 140349 2016-11-23 1.5 1.5
#5: 167109       <NA> 1.0 1.0
#6: 167109 2017-04-16 2.0 2.0
#7: 167109 2017-06-21 1.5 1.5

There are some other ways, too:

# or let DATEE and other cols be filled as NA
library(data.table)
new1 = data.table(ID = unique(df1$ID), A = 1, B = 1)
tabs = list(df1, new1)
res  = rbindlist(tabs, fill = TRUE, idcol = "src")
setorder(res, ID, -src)
res[, src := NULL ]

# or a more compact option (assuming df1$A has no missing values)
library(data.table)
setDT(df1)[, .SD[c(.N+1, seq_len(.N))], ID][is.na(A), c("A", "B") := 1][]
4
votes

Here are two solutions with base R

1

Split into sub-groups based on ID, add a row to the top of each sub-group, and rbind everything back at the end.

do.call(rbind, lapply(split(df, df$ID), function(a){
    rbind(setNames(c(a$ID[1], NA, 1, 1), names(a)), a)
}))
#             ID      DATEE   A   B
#102984.1 102984       <NA> 1.0 1.0
#102984.2 102984 2016-11-23 2.0 2.0
#140349.1 140349       <NA> 1.0 1.0
#140349.2 140349 2016-11-23 1.5 1.5
#167109.1 167109       <NA> 1.0 1.0
#167109.3 167109 2017-04-16 2.0 2.0
#167109.4 167109 2017-06-21 1.5 1.5

2

Or you could initially replicate the first rows (by identifying them with ave) and then substitute appropriate values in each column.

df = df[sort(c(1:NROW(df), which(ave(df$A, df$ID, FUN = seq_along) == 1))),]
df$DATEE = replace(df$DATEE, which(ave(df$A, df$ID, FUN = seq_along) == 1), NA)
df$A = replace(df$A, which(ave(df$A, df$ID, FUN = seq_along) == 1), 1)
df$B = replace(df$B, which(ave(df$A, df$ID, FUN = seq_along) == 1), 1)
df
#        ID      DATEE   A   B
#1   102984       <NA> 1.0 1.0
#1.1 102984 2016-11-23 2.0 2.0
#2   140349       <NA> 1.0 1.0
#2.1 140349 2016-11-23 1.5 1.5
#3   167109       <NA> 1.0 1.0
#3.1 167109 2017-04-16 2.0 2.0
#4   167109 2017-06-21 1.5 1.5
4
votes

Another idea using purrr. First, we split() the data by ID, then we use imap (indexed map) with dfr (return data frames created by row-binding) to loop over each group and add_row() with the specified values.

library(tidyverse)

df %>%
  split(.$ID) %>%
  # We don't have to specify "DATEE", absent variables get missing values
  imap_dfr(~ add_row(.x, ID = .y, A = 1, B = 1, .before = 1))

Which gives:

#      ID      DATEE   A   B
#1 102984       <NA> 1.0 1.0
#2 102984 2016-11-23 2.0 2.0
#3 140349       <NA> 1.0 1.0
#4 140349 2016-11-23 1.5 1.5
#5 167109       <NA> 1.0 1.0
#6 167109 2017-04-16 2.0 2.0
#7 167109 2017-06-21 1.5 1.5

From the documentation:

imap_xxx(x, ...), an indexed map, is short hand for map2(x, names(x), ...) if x has names, or map2(x, seq_along(x), ...) if it does not. This is useful if you need to compute on both the value and the position of an element.

3
votes

Find the indexes of non-duplicates, u, and then repeat those rows giving DF2. Then find the non-duplicates, uu, in DF2 and insert NA, 1, 1 into those rows except for first column. No packages are used.

u <- !duplicated(DF$ID)
DF2 <- DF[rep(1:nrow(DF), 1 + u), ]
uu <- !duplicated(DF2$ID)
DF2[uu, -1] <- list(NA, 1, 1)

giving:

> DF2
        ID      DATEE   A   B
1   102984       <NA> 1.0 1.0
1.1 102984 2016-11-23 2.0 2.0
2   140349       <NA> 1.0 1.0
2.1 140349 2016-11-23 1.5 1.5
3   167109       <NA> 1.0 1.0
3.1 167109 2017-04-16 2.0 2.0
4   167109 2017-06-21 1.5 1.5

Note: The input in reproducible form is:

Lines <- "
     ID      DATEE       A      B 
   102984 2016-11-23      2.0    2.0
   140349 2016-11-23      1.5    1.5
   167109 2017-04-16      2.0    2.0
   167109 2017-06-21      1.5    1.5"
DF <- read.table(text = Lines, header = TRUE)

Updates: Have corrected output (code was correct but output did not correspond) and also simplified code.

2
votes

Joining this party, here is yet another base R solution. We replicate row names in order to expand our data frame, and then simply replace the values

d1 <- df[rep(rownames(df), (!duplicated(df$ID)) + 1),]
d1$DATEE <- replace(d1$DATEE, !duplicated(d1$ID), NA)
d1[-c(1:2)] <- lapply(d1[-c(1:2)], function(i) replace(i, is.na(d1$DATEE), 1))

Which gives,

       ID      DATEE   A   B
1   102984       <NA> 1.0 1.0
1.1 102984 2016-11-23 2.0 2.0
2   140349       <NA> 1.0 1.0
2.1 140349 2016-11-23 1.5 1.5
3   167109       <NA> 1.0 1.0
3.1 167109 2017-04-16 2.0 2.0
4   167109 2017-06-21 1.5 1.5
2
votes

we can also use the by function you wanted to use or even tapply function in base R. for tapply ensure to put the INDICES in a list since this is a data frame. Foe by it is not necessary to put it in a list. So in the code below, we can replace by(A,A$ID... with tapply(A,list(A$ID)... and both will give the same results.

`rownames<-`(do.call(rbind,by(A,A$ID,
                  function(i) rbind(data.frame(ID=i$ID[1],DATEE=NA,A=1,B=1),i))),NULL)
      ID      DATEE   A   B
1 102984       <NA> 1.0 1.0
2 102984 2016-11-23 2.0 2.0
3 140349       <NA> 1.0 1.0
4 140349 2016-11-23 1.5 1.5
5 167109       <NA> 1.0 1.0
6 167109 2017-04-16 2.0 2.0
7 167109 2017-06-21 1.5 1.5

No sorting is needed for this, since that may distort the order that the data was at previously.