Let us assume we have 2 data-frames in R.
a = data.frame(col1 = round(runif(6,1,20)),col2 = c("a b c","b e z","a c q","a b","w","u o p l"), stringsAsFactors = F)
b = data.frame(col1 = 1:10, col2 = round(runif(10,1,10)), col3 = round(runif(10,10,20)), col4 = c(paste(letters[1:15], collapse=" "),paste(letters[10:25], collapse=" "),paste(letters[1:15], collapse=" "),paste(letters[1:19], collapse=" "),paste(letters[10:15], collapse=" "),paste(letters[1:15], collapse=" "),paste(letters[20:25], collapse=" "),paste(letters[1:15], collapse=" "),paste(letters[3:26], collapse=" "),paste(letters[1:2], collapse=" ")),stringsAsFactors = F)
The data sets are :
a
col1 col2
15 a b c
8 b e z
11 a c q
15 a b
5 w
12 u o p l
b
col1 col2 col3 col4
1 1 10 a b c d e f g h i j k l m n o
2 2 12 j k l m n o p q r s t u v w x y
3 4 12 a b c d e f g h i j k l m n o
4 4 16 a b c d e f g h i j k l m n o p q r s
5 2 13 j k l m n o
6 3 15 a b c d e f g h i j k l m n o
7 1 12 t u v w x y
8 2 18 a b c d e f g h i j k l m n o
9 4 16 c d e f g h i j k l m n o p q r s t u v w x y z
10 3 12 a b
I want to create a 3rd column col3 in data-frame "a" based on the following points:
- a$col3 would consist of lists of serial numbers of values from col1 of data-frame "b".
- Those values of b$col1 would be picked for which a$col1 falls in between the values of b$col2 & b$col3 for each row.
- Also, the letters in a$col2 should be present in the b$col4. (Ordering is not required. eg. "a s" is equivalent to "s a".)
Final required dataset.
a
col1 col2 col3
15 a b c 4 6 8
8 b e z
11 a c q 4 9
15 a b 4 6 8
5 w 2 7 9
12 u o p l 2 9
A word of caution - For-loops solutions won't work as the data frames I work with are huge. (Have millions of rows). Any other method will be very much appreciated.
Thanks in advance.