0
votes

I need to collapse partially duplicative rows in a data.table

Example:

  df.in <- data.table (
  tkr = c("abc", "abc", "def", "def", "ghi", "ghi"),
  lboq = c(0,296, -390, 0,  -88, 0),
  locq = c(-296,0, 0, 390, 0, 88),
  ap = c(134,134, 23, 23, 17, 17)
  )

     tkr lboq locq  ap
  1: abc    0 -296 134
  2: abc  296    0 134
  3: def -390    0  23
  4: def    0  390  23
  5: ghi  -88    0  17
  6: ghi    0   88  17

Using this example, I want to end up with three rows where the second row collapses into the first with 296 in column 2; row 4's 390 would be in column 3 of row 3, etc.

So the desired result would be:

  df.out <- data.table (
  tkr = c("abc", "def", "ghi"), 
  lboq = c(296, -390, -88),
  lbcq = c(-296, 390, 88),
  ap = c(134, 23, 17)
  )

     tkr lboq lbcq  ap
  1: abc  296 -296 134
  2: def -390  390  23
  3: ghi  -88   88  17

I have studied the following with the reaction indicated:

collapse rows in 2 different columns in data.table?

collapse rows in data.table

Despite the titles in the above two, they looks to be more of an expansion than collapsing

I also explored using coalesce in the following, but that appears to be for columns only; is there a coalesce for rows?

Coalesce two string columns with alternating missing values to one

Use dplyr coalesce in programming

How to use Coalesce function on a dataframe

I then looked at

R Summarize Collapsed Data.Table

but the collapsing already done here and needs summarizing; I need to do the collapsing

this looks to be exactly what I need

Merging complementary rows of a dataframe with R

but adapting it to my use doesn't seem to work:

    df.in %>%
      group_by(tkr, ap) %>%
      summarise_all(funs(sort(.)[1]))

What am I missing?

1

1 Answers

1
votes

Maybe this will help :

library(dplyr)

df.in %>%
  group_by(tkr) %>%
  summarise(across(lboq:ap, ~.x[.x != 0][1]))

#  tkr    lboq  locq    ap
#* <chr> <dbl> <dbl> <dbl>
#1 abc     296  -296   134
#2 def    -390   390    23
#3 ghi     -88    88    17

For each tkr this selects the 1st non-zero value in columns lboq:ap.