6
votes

R Basket analysis using arules package with unique order number but duplicate order combinations

Just learning R. I'm trying to do a basket analysis using the arules package (but I'm totally open to any other package suggestions!) to compare all possible combinations of 6 different item types being purchased.

My original data set looked like this:

OrderNo, ItemType, ItemCount  
111, Health, 1  
111, Leisure, 2  
111, Sports, 1  
222, Health, 3      
333, Food, 7  
333, Clothing, 1  
444, Clothing, 2  
444, Health, 1  
444, Accessories, 2  

. . .

the list goes on and has about 3,000 observations.

I collapsed the data into a matrix that contains one row for each unique order containing counts of specific ItemType:

 OrderNo, Accessories, Clothing, Food, Health, Leisure, Sports  
 111, 0, 0, 0, 1, 2, 1  
 222, 0, 0, 0, 3, 0, 0  
 333, 0, 1, 7, 0 , 0, 0  
 444, 2, 2, 0, 1, 0, 0  
 . . .

Every time I try to read in the transactions using the following command (and a million attempted variations of it):

tr <- read.transactions("dataset.csv", rm.duplicates=FALSE, format="basket", sep=",")

I get the error message: Error in asMethod(object): can not coerce list with transactions with duplicated items.

I'm assuming this is because I have 3,000 observations and inevitably certain combinations are going to show up more than once (i.e., more than one person is purchasing only one piece of Clothing and nothing else: OrderNo, 0, 1, 0, 0, 0, 0). I know I could collapse the data set on counts of unique combinations, but I'm worried that if I do that, there will be no weights to show the most frequent combinations.

I thought that using format="basket" would account for different orders containing the same item combinations, but apparently that's not the case. I'm so lost. All the documentation I've read implies that this is possible but I can't find any examples or advice on how to approach the problem.

Any advice would be so appreciated! My head is spinning on this one.

Extra info: For my end result, I'm looking to get the top five most significant combinations of purchase combinations. I don't know if that helps.

2
Care to provide a small, self-contained example? stackoverflow.com/questions/5963269/…Roman Luštrik

2 Answers

1
votes

You must remove duplicates, if you are using .CSV file, please run Data -> Remove Duplicate in Excel before processing this file. arules throws error if duplicate are found and it is because of that you are getting the error.

Another way is to use duplicated() on your itemset and remove the duplicate using unique().

Or a more simple approach would be found in this SO post

Association analysis with duplicate transactions using arules package in R

5
votes

Ok, after hours of searching and reading all the pdfs I could find, I finally found the answer (and most helpful walkthrough of apriori/basket analysis ever!) in the DATA MINING Desktop Survival Guide by Graham Williams:

The read.transactions function can also read data from a file with transaction ID and a single item per line (using the format="single" option).

So there was no need to do all those transformations after import. I should have just been importing straight from the original csv file specifying the "single" format option instead of "basket." I also had to make sure the file contained no column names and that there was a unique representation of item type paired with order number (for instance, if a person ordered two items from the "Grocery" category, this needs to be represented on one row). And the cols=c(2,1) option indicates that column 1 contains the order number and column 2 is the rest of the data (ItemType).

tr <- read.transactions(file='dataset.csv', format='single', sep=',', cols=c(2,1))