2
votes

I have the following R data.table dt, which is composed of several numeric columns with two columns of character strings.

dt = data.table(
      numericvals = rep(25, 8),
      numeric = rep(42, 8),
      first = c("beneficiary, duke", "compose", "herd primary", "stall", "deep", "regular summary classify", "timber", "property"),
      second = rep(c("abcde"), 8)
  )

print(dt)
   numericvals numeric                   first second
1:          25      42        beneficiary, duke abcde
2:          25      42                  compose abcde
3:          25      42             herd primary abcde
4:          25      42                    stall abcde
5:          25      42                     deep abcde
6:          25      42 regular summary classify abcde
7:          25      42                   timber abcde
8:          25      42                 property abcde

The column first contains one or more strings. If there are more that one, these are separated by either a space or comma.

My goal is to create a column recording the length of strings in first which are longer or shorter in length (via nchar()) than the strings in second. If these are the same size, this case should be ignored.

If the columns were composed of only one string per row, this analysis would be easy for me. I would create a new column called longer and keep track of the string length in first if it is longer, i.e.

dt[, longer:=ifelse(nchar(first) > nchar(second), nchar(first), 0)]

and similar for shorter:

dt[, shorter:=ifelse(nchar(first) < nchar(second), nchar(first), 0)]

I don't know how to deal with multiple strings in first, especially if there are 3.

Here is how the analysis should look like:

   numericvals numeric                   first second  longer  shorter
1:          25      42        beneficiary, duke abcde  11       4
2:          25      42                  compose abcde  7        0
3:          25      42             herd primary abcde  7        4
4:          25      42                    stall abcde  0        0
5:          25      42                     deep abcde  0        4
6:          25      42 regular summary classify abcde  7, 7, 8  0
7:          25      42                   timber abcde  6        0
8:          25      42                 property abcde  8        0

For the case whether there are several longer/shorter, adding commas to the data.table could be cumbersome. This format would be easier to work with, so the end result I want is as follows:

   numericvals numeric                   first second  longer  shorter
1:          25      42        beneficiary, duke abcde  11      4
2:          25      42                  compose abcde  7       0
3:          25      42             herd primary abcde  7       4
4:          25      42                    stall abcde  0       0
5:          25      42                     deep abcde  0       4
6:          25      42 regular summary classify abcde  7       0
6:          25      42 regular summary classify abcde  7       0
6:          25      42 regular summary classify abcde  8       0
7:          25      42                   timber abcde  6       0
8:          25      42                 property abcde  8       0

How does one compare multiple strings in a data.table, creating new rows for multiple entries?

(I'm using R data.table, but I'm happy to use data.frame as well.)

EDIT: Based on comments below, I realize the second table is wrong. Or at least, values should only be counted once.

1
what if there is a deep regular summary classify? do you want 1 to be repeated in the shorter column for all 3 rows?chinsoon12
@chinsoon12 Good point. No, it should be counted only once. I think the second table is wrong---users would parse the first table to get all values.ShanZhengYang

1 Answers

2
votes

Using base functions but wrap inside data.table

For the first output in OP:

dt[, do.call(rbind, mapply(function(x, snd) {
        lens <- nchar(x[x!=""])
        longer <- lens[lens > snd]
        if (length(longer) == 0L) longer <- 0L
        shorter <- lens[lens < snd]
        if (length(shorter) == 0L) shorter <- 0L

        list(list(longer), list(shorter))            
    }, strsplit(first, ",| "), nchar(second), SIMPLIFY=FALSE)), by=names(dt)]

For the 2nd output in OP,

dt[, do.call(rbind, mapply(function(x, snd) {
    lens <- nchar(x[x!=""])
    longer <- lens[lens > snd]
    if (length(longer) == 0L) longer <- 0L
    shorter <- lens[lens < snd]
    if (length(shorter) == 0L) shorter <- 0L

    #pad to equal length
    if (length(longer) > length(shorter)) {
        shorter <- c(shorter, rep(0L, length(longer) - length(shorter)))
    } 
    if (length(longer) < length(shorter)) {
        longer <- c(longer, rep(0L, length(shorter) - length(longer)))
    }

    #second kind of output
    data.frame(longer, shorter)
}, strsplit(first, ",| "), nchar(second), SIMPLIFY=FALSE)), by=names(dt)]

Explanation: First split each string into words using strsplit(first, ",| "), then apply the OP requirements in checking for word length greater than or smaller than word length of the reference column. Then, row bind results into a data.frame and return the results.