1
votes

Suppose I have a data frame with a few categorical variables and some columns that are string values. I would like to create a new column that, for each row, pastes string values from other rows if certain values in the categorical columns match (or don't match). Here is a toy example.

toy <- data.frame("id" = c(1,2,3,2), "year" = c(2000,2000,2004,2004), "words" = c("a b", "c d", "e b", "c d"))

I would like to create a variable word_pool that is pasted from other rows' words column if two criteria are met: the row's id value is different from the current row's id value and the row's year value is less than the current row's year value.

What should result is

id year words word_pool
1 2000    a b
2 2000    c d
3 2004    e b   a b c d
2 2004    c d       a b

The first two rows will be blank for the new column since there isn't a year less than 2000 in the toy example. The last row will only have "a b" as the value in the new column since its id is repeated.

I've tried various apply and group_by approaches but none seem to fit the bill exactly. Would appreciate any and all ideas!

2

2 Answers

1
votes

I've used sqldf and plyr packages to achieve the solution. Though I wouldn't call this an elegant solution it works. Hope to see a more efficient solution from others.

library(sqldf)

toy <- data.frame("id" = c(1,2,3,2), 
                   "year" = c(2000,2000,2004,2004), 
                   "words" = c("a b", "c d", "e b", "c d"))

toy

#  id year words
#1  1 2000   a b
#2  2 2000   c d
#3  3 2004   e b
#4  2 2004   c d

df <- sqldf('SELECT t1.*,t2.words AS word_pool FROM toy t1 LEFT JOIN toy t2 
       ON t1.year > t2.year AND
       t1.words <> t2.words')

df
#  id year words word_pool
#1  1 2000   a b      <NA>
#2  2 2000   c d      <NA>
#3  3 2004   e b       a b
#4  3 2004   e b       c d
#5  2 2004   c d       a b

result <- plyr::ddply(df,c("id","year","words"), 
                      function(dfx)paste(dfx$word_pool, 
                                         collapse = " "))

result
#  id year words      V1
#1  1 2000   a b      NA
#2  2 2000   c d      NA
#3  2 2004   c d     a b
#4  3 2004   e b a b c d
0
votes

With a for and which, this must write like apply and no use extern libraries

        ## Create data
        toy <-
          data.frame(
            "id" = c(1, 2, 3, 2),
            "year" = c(2000, 2000, 2004, 2004),
            "words" = c("a b", "c d", "e b", "c d")
          )

        toy$word_pool <- 0
        for (i in 1:length(toy)) {
          # Recognize index from condition
          condition_index <- which(toy$year[i] > toy$year
                                        & toy$id[i] != toy$id)
          # assign
          if (length(condition_index) == 0){# case no index
            toy$word_pool[i] = ""
          }
          else{# paste with collapse join array
            toy$word_pool[i] = paste(toy$words[condition_index],
                                     collapse = " ", sep = " ")
          }
        }
        toy
        # id year words word_pool
        # 1  2000   a b          
        # 2  2000   c d          
        # 3  2004   e b   a b c d
        # 2  2004   c d       a b