2
votes

I have a column 'product_list' in a dataframe which looks like this: ";165533;3;1050.00;,;165535;1;700.00;

This is a list of products bought within a purchase id. In the above example 165533 is SKU, 3 is the quantity of products purchased, 1050.00 is the amount of purchase and so on. This field can contain multiple product SKU's. Multiple SKU's are separated by comma. I want to extract only SKU's from this string in R using regex which I am new to.

str = c(";165533;3;1050.00;,;165535;1;700.00;")

I was able to split the SKU's by using below:

strsplit(Type, ",;"). 

My question is how do I extract only the first value from the comma separated values.

I want the final output to look like this:

Purchase ID  SKU
123          165533
123          165535

Is there a better way to extract this data?

Here is the dput output:

dput(Purchase_test[, c(1, 2)]) structure(list(post_purchaseid = c(123L, 456L, 321L, 888L, 345L, 938L, 647L, 657L, 687L, 547L, 647L, 711L, 811L, 911L, 1001L), post_product_list = structure(c(6L, 4L, 11L, 9L, 2L, 5L, 7L, 1L, 3L, 4L, 10L, 8L, 4L, 12L, 13L), .Label = c(";153147;1;100.00;,;165533;1;350.00;,;165537;1;3800.00;", ";153147;1;100.00;,;165533;3;1050.00;,;165531;1;200.00;,;165535;1;700.00;", ";153147;1;100.00;,;165533;3;1050.00;,;165536;1;2750.00;", ";153147;1;100.00;,;165535;1;700.00;", ";153147;1;100.00;,;165535;2;1400.00;", ";153147;1;12.05;,;165531;1;24.11;,;153418;5;500.00;", ";153147;1;15.34;,;165533;1;53.70;", ";153147;1;31.51;,;153418;2;200.00;", ";153147;1;43.84;,;165531;1;87.67;", ";153147;1;49.86;,;165533;1;174.52;", ";165533;3;1050.00;,;165535;1;700.00;", ";creating your first text;1;4200.00;207=4200.00;,;Get started with;1;3900.00;207=3900.00;", ";Get started with;1;3900.00;207=3900.00;"), class = "factor")), class = "data.frame", row.names = c(NA, -15L))

3
Can you use dput to show the exampleakrun
Where does 123 come from?Cary Swoveland
@akrun dput attached in original questionMP12
@MP12 just a doubt, in the first row, it is ";153147;1;12.05;,;165531;1;24.11;,;153418;5;500.00;", so do you expect the output to have 153147, 165531, 153418 as new rowsakrun
@r2evans For each purchase id, if it contains multiple SKU's, I want the output to contain separate rows for each SKU along with the Purchase id.MP12

3 Answers

1
votes

We can extract with str_extract_all and unnest the list column to expand the rows

library(dplyr)
library(tidyr)
library(stringr)
out <- Purchase_test %>% 
     mutate(SKU = str_extract_all(post_product_list, "(?<=(^|,);)[^;]+")) %>% 
     unnest(c(SKU))
out
# A tibble: 34 x 3
#   post_purchaseid post_product_list                                                        SKU   
#             <int> <fct>                                                                    <chr> 
# 1             123 ;153147;1;12.05;,;165531;1;24.11;,;153418;5;500.00;                      153147
# 2             123 ;153147;1;12.05;,;165531;1;24.11;,;153418;5;500.00;                      165531
# 3             123 ;153147;1;12.05;,;165531;1;24.11;,;153418;5;500.00;                      153418
# 4             456 ;153147;1;100.00;,;165535;1;700.00;                                      153147
# 5             456 ;153147;1;100.00;,;165535;1;700.00;                                      165535
# 6             321 ;165533;3;1050.00;,;165535;1;700.00;                                     165533
# 7             321 ;165533;3;1050.00;,;165535;1;700.00;                                     165535
# 8             888 ;153147;1;43.84;,;165531;1;87.67;                                        153147
# 9             888 ;153147;1;43.84;,;165531;1;87.67;                                        165531
#10             345 ;153147;1;100.00;,;165533;3;1050.00;,;165531;1;200.00;,;165535;1;700.00; 153147
# … with 24 more rows

out$SKU
#[1] "153147"                   "165531"                   "153418"                   "153147"                  
#[5] "165535"                   "165533"                   "165535"                   "153147"                  
#[9] "165531"                   "153147"                   "165533"                   "165531"                  
#[13] "165535"                   "153147"                   "165535"                   "153147"                  
#[17] "165533"                   "153147"                   "165533"                   "165537"                  
#[21] "153147"                   "165533"                   "165536"                   "153147"                  
#[25] "165535"                   "153147"                   "165533"                   "153147"                  
#[29] "153418"                   "153147"                   "165535"                   "creating your first text"
#[33] "Get started with"         "Get started with"  
0
votes

Your data is a little inconsistent with your expected output (where is "purchase id"?), but I don't think you need a regex here.

read.csv2(text=gsub(",", "\n", ";165533;3;1050.00;,;165535;1;700.00;"), header=FALSE)
#   V1     V2 V3      V4 V5
# 1 NA 165533  3 1050.00 NA
# 2 NA 165535  1  700.00 NA

and from there you can discard columns, name them, etc.

0
votes

You can extract the matches with the regex

(?<=;)\d+(?=;\d+;\d+\.\d+)

Demo

The regex performs the following operations:

(?<=;)  # match ';' in a positive lookbehind
\d+     # match 1+ digits (SKU)
(?=     # begin a positive lookahead
  ;\d+  # match ';' then 1+ digits
  ;\d+  # match ';' then 1+ digits
  \.\d+ # match '.' then 1+ digits
)       # end positive lookahead

The positive lookbehind and lookahead are zero-width matches, meaning they are not part of the full matches; they merely enforce requirements.

If the SKU must be six digits you could replace the first \d+ with \d{6}.