0
votes

I am using a data.table to store data which is in a string format. The strings hold information that I want to retrieve using a function. This function, in my real script, does multiple calculations and parsing, and at the end returns another data.table with many columns and many rows. This function receives a whole row of my original data.table as argument (all variables are used): myFun(dt[rowNumber, ]

While some columns of my original data.table will still be used later in my script, one of the variables in the data.table is expendable after processing, so I want to replace this variable with the data.table I get from my function. This allows me to keep some link between my remaining variables and this new data.table, so I can later pass all together to other functions.

However, since I am working with many rows, I want to speed things up using data.table::set function to update my cell, but R won't allow me to use:

data.table::set(dt, i=rowNum, j=colNum, value = list(list(myFun(dt[rowNum, ])))

If firstly, I don't do:

dt$someVar[1L] <- list(myFun(dt[1L, ]))

This is the following warning that I get using only set

In data.table::set(dt, i = rowNum, j = colNum, value = list(list(myFun(dt[rowNum, : Coerced 'list' RHS to 'character' to match the column's type. Either change the target column to 'list' first (by creating a new 'list' vector length 3 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'character' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please.

I receive the same warning when using solely:

dt[rowNum, ((names(dt))[colNum]) := list(list(myFun(dt[rowNum, ])))]

Here is an clear illustrative example (not real problem) of this issue I am facing:

col1 <- as.character(1:3)
col2 <- as.character(4:6)
col3 <- as.character(7:9)
dt <- data.table::data.table(var1 = col1, var2 = col2, var3 = col3)

myFun <- function(rowDt)
{
    v1 <- as.numeric(rowDt$var1[1])
    v2 <- as.numeric(rowDt$var2[1])
    v3 <- as.numeric(rowDt$var3[1])
    col1 <- c(v1*v2, v1*v3) 
    col2 <- c(v2*v2, v2*v3)

    return(data.table::data.table(var1 = col1, var2 = col2))
}

colNum = 3L
for (rowNum in 1L:nrow(dt))
{
  data.table::set(dt, i=rowNum, j=colNum, value = list(list(myFun(dt[rowNum, ]))))
}

The above code yields the previous warning message, howwever, this works:

colNum = 3L
dt$var3[1L] <- list(myFun(dt[1L, ]))
for (rowNum in 2L:nrow(dt))
{
  data.table::set(dt, i=rowNum, j=colNum, value = list(list(myFun(dt[rowNum, ]))))
}

Is this an expected behavior? If it is, why does it happen and how could I take advantage of data.table::set higher performance by only using it?

1
it's not clear what you're trying to do. why not just do: dt[ , (names(dt)) := lapply(.SD, as.numeric)]; dt[ , c('new_var_1', 'new_var_2') := .(var1 + var2, var1 + var3)]?MichaelChirico
Very interesting comment, I am fairly new in R and those are some new syntax you've showed me. This is simply an illustrative example that triggers the same warning message that I receive in my real script. In my real script I need to apply a lot of parsing and complex mathematical equations, which results in a bunch of new data that I would like to save in a data.table format. Instead of generating new columns in my previous data.table, I would like to keep it 'clean' while still having its rows linked with this new data.table.Guidotti
You should be seeing an error like Item 1 of column numbers in j is 3 which is outside range [1,ncol=2]. Use column names instead in j to add new columns. The FAQ covers why it's designed to prefer column names over numbers. Ah, never mind, I see you do create the third column with dt$var3[1L] <- list(myFun(dt[1L, ])) which is very unidiomatic. You might want to take a detour to become familiar with the typical package syntax (which looks more like what Michael wrote).Frank
Yes, I agree. As I am new in R I am struggling to learn some of its syntax. However, I am not aiming to add new columns in my code, if you take a careful look I am trying to replace a column of strings with data.tables. This string data is expandable after my calculations, and my data.table returns more rows than what I currently have in my original data.table, so I cannot add new columns to this data.table with information retrieved from the strings (actually I could if I fill the new rows with NAs or zeros). In this case, what would be the "recommended" or more efficient solution?Guidotti
data.table is very strict with its column types (for reasons of efficiency and also because its safer from a user perspective). If you want to change a column's type you should always do that explicitly. However, have you considered simply creating a new column?Roland

1 Answers

0
votes

data.table doesn't like coercing an entire column to another class because one of its values changes. But it's totally fine if you replace the entire column. And that jives better with R's vectorized approach.

myFun <- function(wholeDT) {
  dt_copy <- copy(wholeDT)
  dt_copy[, ':='(
    var1 = as.numeric(var1),
    var2 = as.numeric(var2),
    var3 = as.numeric(var3)
  )][, ':='(
    col1 = mapply(FUN = c, var1 * var2, var1 * var3, SIMPLIFY = FALSE),
    col2 = mapply(FUN = c, var2 * var2, var2 * var3, SIMPLIFY = FALSE)
  )][
    ,
    var3 := apply(.SD, MARGIN = 1L, as.data.table),
    .SDcols = c('col1', 'col2')
  ]
  # Remove the intermediate columns
  set(dt_copy, j = c('col1', 'col2'), value = NULL)
  dt_copy
}


dt_new <- myFun(dt)
dt_new
#      var1 var2         var3
#   1:    1    4 <data.table>
#   2:    2    5 <data.table>
#   3:    3    6 <data.table>

dt_new$var3[[1]]
#    col1 col2
# 1:    4   16
# 2:    7   28

I just took your function, had it work column-wise instead of row-wise, and then used mapply to do the list columns.

The most esoteric part is the apply call. It runs through each row of the subdata (which only columns col1 and col2, thanks to .SDcols). Each row is passed as a list, which is perfect for as.data.table.

I don't like using apply, though, as there's usually a more efficient solution. Until somebody points it out, this should work as you want.