I have the following R data.table dt
, which is composed of several numeric columns with two columns of character strings.
dt = data.table(
numericvals = rep(25, 8),
numeric = rep(42, 8),
first = c("beneficiary, duke", "compose", "herd primary", "stall", "deep", "regular summary classify", "timber", "property"),
second = rep(c("abcde"), 8)
)
print(dt)
numericvals numeric first second
1: 25 42 beneficiary, duke abcde
2: 25 42 compose abcde
3: 25 42 herd primary abcde
4: 25 42 stall abcde
5: 25 42 deep abcde
6: 25 42 regular summary classify abcde
7: 25 42 timber abcde
8: 25 42 property abcde
The column first
contains one or more strings. If there are more that one, these are separated by either a space or comma.
My goal is to create a column recording the length of strings in first
which are longer or shorter in length (via nchar()
) than the strings in second
. If these are the same size, this case should be ignored.
If the columns were composed of only one string per row, this analysis would be easy for me. I would create a new column called longer
and keep track of the string length in first
if it is longer, i.e.
dt[, longer:=ifelse(nchar(first) > nchar(second), nchar(first), 0)]
and similar for shorter:
dt[, shorter:=ifelse(nchar(first) < nchar(second), nchar(first), 0)]
I don't know how to deal with multiple strings in first
, especially if there are 3.
Here is how the analysis should look like:
numericvals numeric first second longer shorter
1: 25 42 beneficiary, duke abcde 11 4
2: 25 42 compose abcde 7 0
3: 25 42 herd primary abcde 7 4
4: 25 42 stall abcde 0 0
5: 25 42 deep abcde 0 4
6: 25 42 regular summary classify abcde 7, 7, 8 0
7: 25 42 timber abcde 6 0
8: 25 42 property abcde 8 0
For the case whether there are several longer/shorter, adding commas to the data.table could be cumbersome. This format would be easier to work with, so the end result I want is as follows:
numericvals numeric first second longer shorter
1: 25 42 beneficiary, duke abcde 11 4
2: 25 42 compose abcde 7 0
3: 25 42 herd primary abcde 7 4
4: 25 42 stall abcde 0 0
5: 25 42 deep abcde 0 4
6: 25 42 regular summary classify abcde 7 0
6: 25 42 regular summary classify abcde 7 0
6: 25 42 regular summary classify abcde 8 0
7: 25 42 timber abcde 6 0
8: 25 42 property abcde 8 0
How does one compare multiple strings in a data.table, creating new rows for multiple entries?
(I'm using R data.table, but I'm happy to use data.frame as well.)
EDIT: Based on comments below, I realize the second table is wrong. Or at least, values should only be counted once.
deep regular summary classify
? do you want 1 to be repeated in theshorter
column for all 3 rows? – chinsoon12