Extract values from a string in R

Question

I have a column 'product_list' in a dataframe which looks like this: ";165533;3;1050.00;,;165535;1;700.00;

This is a list of products bought within a purchase id. In the above example 165533 is SKU, 3 is the quantity of products purchased, 1050.00 is the amount of purchase and so on. This field can contain multiple product SKU's. Multiple SKU's are separated by comma. I want to extract only SKU's from this string in R using regex which I am new to.

str = c(";165533;3;1050.00;,;165535;1;700.00;")

I was able to split the SKU's by using below:

strsplit(Type, ",;").

My question is how do I extract only the first value from the comma separated values.

I want the final output to look like this:

Purchase ID  SKU
123          165533
123          165535

Is there a better way to extract this data?

Here is the dput output:

dput(Purchase_test[, c(1, 2)]) structure(list(post_purchaseid = c(123L, 456L, 321L, 888L, 345L, 938L, 647L, 657L, 687L, 547L, 647L, 711L, 811L, 911L, 1001L), post_product_list = structure(c(6L, 4L, 11L, 9L, 2L, 5L, 7L, 1L, 3L, 4L, 10L, 8L, 4L, 12L, 13L), .Label = c(";153147;1;100.00;,;165533;1;350.00;,;165537;1;3800.00;", ";153147;1;100.00;,;165533;3;1050.00;,;165531;1;200.00;,;165535;1;700.00;", ";153147;1;100.00;,;165533;3;1050.00;,;165536;1;2750.00;", ";153147;1;100.00;,;165535;1;700.00;", ";153147;1;100.00;,;165535;2;1400.00;", ";153147;1;12.05;,;165531;1;24.11;,;153418;5;500.00;", ";153147;1;15.34;,;165533;1;53.70;", ";153147;1;31.51;,;153418;2;200.00;", ";153147;1;43.84;,;165531;1;87.67;", ";153147;1;49.86;,;165533;1;174.52;", ";165533;3;1050.00;,;165535;1;700.00;", ";creating your first text;1;4200.00;207=4200.00;,;Get started with;1;3900.00;207=3900.00;", ";Get started with;1;3900.00;207=3900.00;"), class = "factor")), class = "data.frame", row.names = c(NA, -15L))

@MP12 just a doubt, in the first row, it is ";153147;1;12.05;,;165531;1;24.11;,;153418;5;500.00;", so do you expect the output to have 153147, 165531, 153418 as new rows — akrun
@r2evans For each purchase id, if it contains multiple SKU's, I want the output to contain separate rows for each SKU along with the Purchase id. — MP12

akrun akrun · Accepted Answer · 2020-03-18T18:54:53

We can extract with str_extract_all and unnest the list column to expand the rows

library(dplyr)
library(tidyr)
library(stringr)
out <- Purchase_test %>% 
     mutate(SKU = str_extract_all(post_product_list, "(?<=(^|,);)[^;]+")) %>% 
     unnest(c(SKU))
out
# A tibble: 34 x 3
#   post_purchaseid post_product_list                                                        SKU   
#             <int> <fct>                                                                    <chr> 
# 1             123 ;153147;1;12.05;,;165531;1;24.11;,;153418;5;500.00;                      153147
# 2             123 ;153147;1;12.05;,;165531;1;24.11;,;153418;5;500.00;                      165531
# 3             123 ;153147;1;12.05;,;165531;1;24.11;,;153418;5;500.00;                      153418
# 4             456 ;153147;1;100.00;,;165535;1;700.00;                                      153147
# 5             456 ;153147;1;100.00;,;165535;1;700.00;                                      165535
# 6             321 ;165533;3;1050.00;,;165535;1;700.00;                                     165533
# 7             321 ;165533;3;1050.00;,;165535;1;700.00;                                     165535
# 8             888 ;153147;1;43.84;,;165531;1;87.67;                                        153147
# 9             888 ;153147;1;43.84;,;165531;1;87.67;                                        165531
#10             345 ;153147;1;100.00;,;165533;3;1050.00;,;165531;1;200.00;,;165535;1;700.00; 153147
# … with 24 more rows

out$SKU
#[1] "153147"                   "165531"                   "153418"                   "153147"                  
#[5] "165535"                   "165533"                   "165535"                   "153147"                  
#[9] "165531"                   "153147"                   "165533"                   "165531"                  
#[13] "165535"                   "153147"                   "165535"                   "153147"                  
#[17] "165533"                   "153147"                   "165533"                   "165537"                  
#[21] "153147"                   "165533"                   "165536"                   "153147"                  
#[25] "165535"                   "153147"                   "165533"                   "153147"                  
#[29] "153418"                   "153147"                   "165535"                   "creating your first text"
#[33] "Get started with"         "Get started with"

Extract values from a string in R

3 Answers