20
votes

I looking for a way to efficiently apply a function to each row of data.table. Let's consider the following data table:

library(data.table)
library(stringr)

x <- data.table(a = c(1:3, 1), b = c('12 13', '14 15', '16 17', '18 19'))
> x
   a     b
1: 1 12 13
2: 2 14 15
3: 3 16 17
4: 1 18 19

Let's say I want to split each element of column b by space (thus yielding two rows for each row in the original data) and join the resulting data tables. For the example above, I need the following result:

   a V1
1: 1 12
2: 1 13
3: 2 14
4: 2 15
5: 3 16
6: 3 17
7: 1 18
8: 1 19

The following would work if column a has only unique values:

x[, list(str_split(b, ' ')[[1]]), by = a]

The following almost works (unless there are some identical rows in the original data table), but is ugly when x has many columns and copies column b to the result, which I would like to avoid.

>     x[, list(str_split(b, ' ')[[1]]), by = list(a,b)]
   a     b V1
1: 1 12 13 12
2: 1 12 13 13
3: 2 14 15 14
4: 2 14 15 15
5: 3 16 17 16
6: 3 16 17 17
7: 1 18 19 18
8: 1 18 19 19

What would be the most efficient and idiomatic way to solve this problem?

7

7 Answers

14
votes

How about :

x
   a     b
1: 1 12 13
2: 2 14 15
3: 3 16 17
4: 1 18 19

x[,list(a=rep(a,each=2), V1=unlist(strsplit(b," ")))]
   a V1
1: 1 12
2: 1 13
3: 2 14
4: 2 15
5: 3 16
6: 3 17
7: 1 18
8: 1 19

Generalized solution given comment :

x[,{s=strsplit(b," ");list(a=rep(a,sapply(s,length)), V1=unlist(s))}]
4
votes
x[, .(a,strsplit(b,' ')), by=1:nrow(x)]

by=nrow(x) is a simple way to force 1 row per by-group

4
votes
x[, .(a,strsplit(b,' ')), by = .I]

looks more estetic

2
votes

One option would be to add a row number

x[, r := 1:nrow(x)]

and then group by r:

x[, list(a, str_split(b, ' ')[[1]]), by = r]

I'm wondering if there are better solutions?

2
votes

The most effective and idiomatic approach is to have a vectorized function.

In this case, some kind of regex will do what you want

 x[, V1 := gsub(" [[:alnum:]]*", "", b)]

   a     b V1
1: 1 12 13 12
2: 2 14 15 14
3: 3 16 17 16
4: 1 18 19 18

If you want to return the each split component, and you know there are two in each one, you can use Map to coerce the result of strsplit into the correct form

x[, c('b1','b2')  := do.call(Map, c(f = c, strsplit(b, ' ')))]



x
   a     b b1 b2
1: 1 12 13 12 13
2: 2 14 15 14 15
3: 3 16 17 16 17
4: 1 18 19 18 19
1
votes

The dplyr/tidyr approach also works with data tables.

library(dplyr)
library(tidyr)
x %>% 
  separate(b, into = c("b1", "b2")) %>% 
  gather(b, "V1", b1:b2) %>%
  arrange(V1) %>%
  select(a, V1)

Or, using the standard evaluation forms:

x %>% 
  separate_("b", into = c("b1", "b2")) %>% 
  gather_("b", "V1", c("b1", "b2")) %>%
  arrange_(~ V1) %>%
  select_(~ a, ~ V1)

The case of different numbers of values in the b column is only slightly more complicated.

library(stringr)

x2 <- data.table(
  a = c(1:3, 1), 
  b = c('12 13', '14', '15 16 17', '18 19')
)

n <- max(str_count(x2$b, " ")) + 1
b_cols <- paste0("b", seq_len(n))
x2 %>% 
  separate_("b", into = b_cols, extra = "drop") %>% 
  gather_("b", "V1", b_cols) %>%
  arrange_(~ V1) %>%
  select_(~ a, ~ V1)
0
votes

Looking at input and desired output, this should work -

x <- data.frame(a=c(1,2,3,1),b=c("12 13","14 15","16 17","18 19"))
data.frame(a=rep(x$a,each=2), new_b=unlist(strsplit(as.character(x$b)," ")))