Replace duplicated values in consecutive runs with blank

Question

First, some data:

library(data.table)

# 1. Input table
df_input <- data.table(
  x = c("x1", "x1", "x1", "x2", "x2"),
  y = c("y1", "y1", "y2", "y1", "y1"),
  z = c(1:5))

In each column, I want to keep only the first value in each run of consecutive values. E.g. look at the y column, which has three different runs: (1) two y1, (2) one y2, and (3) a second run of y1. Within each such run, duplicated values should be replaced with "".

#     x  y z
# 1: x1 y1 1   # 1st value in run of y1: keep
# 2: x1 y1 2   # 2nd value in run: replace
# 3: x1 y2 3   # 1st value in run: keep
# 4: x2 y1 4   # 1st value in 2nd run of y1: keep
# 5: x2 y1 5   # 2nd value: replace

Thus, the desired output table:

df_output <- data.table(
  x = c("x1", "", "",  "x2", ""),
  y = c("y1", "", "y2", "y1", ""),
  z = c(1:5))

#     x  y z
# 1: x1 y1 1
# 2:       2
# 3:    y2 3
# 4: x2 y1 4
# 5:       5

How it's possible to get "output" table by using dplyr or data.table packages?

Thanks

akrun akrun · Accepted Answer · 2020-05-22T20:55:56

We can use set with data.table

library(data.table)
for(j in names(df_input)) 
  set(df_input, i = which(duplicated(rleid(df_input[[j]]))), j = j, value = '')

df_input
#    x  y z
#1: x1 y1 1
#2:       2
#3:    y2 3
#4: x2 y1 4
#5:       5

Replace duplicated values in consecutive runs with blank

2 Answers