Given an arbitrary list of column names in a data.table
, I want to concatenate the contents of those columns into a single string stored in a new column. The columns I need to concatenate are not always the same, so I need to generate the expression to do so on the fly.
I have a sneaking suspicion that way I'm using the eval(parse(...))
call could be replaced with something a bit more elegant, but the method below is the fastest I've been able to get it so far.
With 10 million rows, this takes about 21.7 seconds on this sample data (base R paste0
takes slightly longer -- 23.6 seconds). My actual data has 18-20 columns being concatenated and up to 100 million rows, so the slowdown becomes a little more impractical.
Any ideas to get this sped up?
Current methods
library(data.table)
library(stringi)
RowCount <- 1e7
DT <- data.table(x = "foo",
y = "bar",
a = sample.int(9, RowCount, TRUE),
b = sample.int(9, RowCount, TRUE),
c = sample.int(9, RowCount, TRUE),
d = sample.int(9, RowCount, TRUE),
e = sample.int(9, RowCount, TRUE),
f = sample.int(9, RowCount, TRUE))
## Generate an expression to paste an arbitrary list of columns together
ConcatCols <- c("x","a","b","c","d","e","f","y")
PasteStatement <- stri_c('stri_c(',stri_c(ConcatCols,collapse = ","),')')
print(PasteStatement)
gives
[1] "stri_c(x,a,b,c,d,e,f,y)"
which is then used to concatenate the columns with the following expression:
DT[,State := eval(parse(text = PasteStatement))]
Sample of output:
x y a b c d e f State
1: foo bar 4 8 3 6 9 2 foo483692bar
2: foo bar 8 4 8 7 8 4 foo848784bar
3: foo bar 2 6 2 4 3 5 foo262435bar
4: foo bar 2 4 2 4 9 9 foo242499bar
5: foo bar 5 9 8 7 2 7 foo598727bar
Profiling Results
Update 1: fread
, fwrite
, and sed
Following @Gregor 's suggestion, tried using sed
to do the concatenation on disk. Thanks to data.table's blazing fast fread
and fwrite
functions, I was able to write out the columns to disk, eliminate comma delimiters using sed ,and then read back in the post-processed output in about 18.3 seconds -- not quite fast enough to make the switch, but an interesting tangent nonetheless!
ConcatCols <- c("x","a","b","c","d","e","f","y")
fwrite(DT[,..ConcatCols],"/home/xxx/DT.csv")
system("sed 's/,//g' /home/xxx/DT.csv > /home/xxx/DT_Post.csv ")
Post <- fread("/home/xxx/DT_Post.csv")
DT[,State := Post[[1]]]
Breakdown of the 18.3 overall seconds (unable to use profvis since sed
is invisible to the R profiler)
data.table::fwrite()
- 0.5 secondssed
- 14.8 secondsdata.table::fread()
- 3.0 seconds:=
- 0.0 seconds
If nothing else, this is a testament to the extensive work of the data.table authors on performance optimizations for disk IO. (I'm using the 1.10.5 development version that adds multi-threading to fread
, fwrite
has been multithreaded for some time).
One caveat: if there is a workaround to write the file using fwrite
and a blank separator as suggested by @Gregor in another comment below, then this method could plausibly be cut down to ~3.5 seconds!
Update on this tangent: forked data.table and commented out the line requiring a separator greater than length 0, mysteriously got some spaces instead? After causing a handful of segfaults trying to mess around with the C
internals I put this one on ice for the time being. Ideal solution would not require writing to disk and would keep everything in memory.
Update 2: sprintf
for Integer Specific Cases
A second update here: While I included strings in my original usage example, my actual use case exclusively concatenates integer values (which can always be assumed non-null based on upstream cleaning steps).
Since the usage case is highly specific and differs from the original question I won't directly compare timings to those previously posted. However, one takeaway is that while stringi
nicely handles many character encoding formats, mixed vector types without needing to specify them, and does a bunch of error handling out of the box, this does add some time (which is probably worth it for most cases).
By using base R's sprintf
function and letting it know up front that all of the inputs will be integers, we can shave off about 30% of the run-time for 5 million rows with 18 integer columns to be calculated. (20.3 seconds instead of 28.9)
library(data.table)
library(stringi)
RowCount <- 5e6
DT <- data.table(x = "foo",
y = "bar",
a = sample.int(9, RowCount, TRUE),
b = sample.int(9, RowCount, TRUE),
c = sample.int(9, RowCount, TRUE),
d = sample.int(9, RowCount, TRUE),
e = sample.int(9, RowCount, TRUE),
f = sample.int(9, RowCount, TRUE))
## Generate an expression to paste an arbitrary list of columns together
ConcatCols <- list("a","b","c","d","e","f")
## Do it 3x as many times
ConcatCols <- c(ConcatCols,ConcatCols,ConcatCols)
## Using stringi::stri_c ---------------------------------------------------
stri_joinStatement <- stri_c('stri_join(',stri_c(ConcatCols,collapse = ","),', sep="", collapse=NULL, ignore_null=TRUE)')
DT[, State := eval(parse(text = stri_joinStatement))]
## Using sprintf -----------------------------------------------------------
sprintfStatement <- stri_c("sprintf('",stri_flatten(rep("%i",length(ConcatCols))),"', ",stri_c(ConcatCols,collapse = ","),")")
DT[,State_sprintf_i := eval(parse(text = sprintfStatement))]
The generated statements are as follows:
> cat(stri_joinStatement)
stri_join(a,b,c,d,e,f,a,b,c,d,e,f,a,b,c,d,e,f, sep="", collapse=NULL, ignore_null=TRUE)
> cat(sprintfStatement)
sprintf('%i%i%i%i%i%i%i%i%i%i%i%i%i%i%i%i%i%i', a,b,c,d,e,f,a,b,c,d,e,f,a,b,c,d,e,f)
Update 3: R
does not have to be slow.
Based off the answer by @Martin Modrák, I put together a one-trick pony package based on some data.table
internals specialized for the specialized "single digit integer" case: fastConcat
. (Don't look for it on CRAN any time soon, but you can use it at your own risk by installing from github repo, msummersgill/fastConcat.)
This could probably be improved much further by someone who understands c
better, but for now, it's running the same case as in Update 2 in 2.5 seconds -- around 8x faster than sprintf()
and 11.5x faster than the stringi::stri_c()
method I was using originally.
To me, this highlights the huge opportunity for performance improvements on some of the simplest operations in R
like rudimentary string-vector concatenation with better tuned c
. I guess people like @Matt Dowle have seen this for years-- if only he had the time to re-write all of R
, not just the data.frame.
stri_c
does is immediately all a C++ function to concatenate the strings. I don't think you'll be able to beat its performance in R. Evenpaste
goes very quickly to compiled code, hence its performance being almost as good. – Gregor Thomasfread
won't allow a blank separator, butreadr::write_delim
will. It's probably too slow, but worth a try. (c)sed
is probably the fastest you can do from the command line, but the answers to this question suggest that you can get some speed-up with different syntax and especially if you copy the file instead of editing it in place. – Gregor Thomasfwrite
keeps you from specifying""
as the separator. You could try usingfixInNamespace
to remove that line and see if it will then allow you tofwrite
withsep = ""
. I've never usedfixInNamespace
before but that should be do-able. The open question is whether there are deeper reasons forsep
to not be an empty string. – Gregor Thomassep = ""
imo. – eddi