1
votes

I have transaction data that is in an edge form, and I need to create a transaction based sparse matrix that can be used with the arules R package. Currently I am using "spread" from the tidyr package to convert the edge list to matrix with each row as a "basket ID. Then I after converting it to a logical since I can't use quantity information with arules I convert it the "transaction" based data type. Please see an example of R code below.

My question is that this works for small sets basket/transactions, but when I have many more it causes memory problems due to the "spread" function. I am wondering if there is a more memory/resource efficient way to convert the original edge view to the transactions data type that arules uses? Thanks in advance for any suggestions!

## Load libraries

library(tidyr)
library(arules)

## Create an example of the transactions that I am analizing 

TransEdgeList = data.frame(BasketID=c(1,1,2,2,3,3,3), 
                               Item=c(10,11,10,12,10,11,13),
                               Qty=c(1,1,2,3,1,2,1))

#convert to something that arules can transform
BasketDataFrame = spread(TransEdgeList, Item, Qty)

#convert to logical 
BasketDataFrame[, 2:dim(BasketDataFrame)[2]]=  
  !is.na(BasketDataFrame[, 2:dim(BasketDataFrame)[2]])

#convert to a transaction sparse matrix that arules can use
BasketMatrix = as(BasketDataFrame[, 2:dim(BasketDataFrame)[2]], "transactions")

BasketMatrix
1
is table any better? table(TransEdgeList[1:2]) or xtabs(~ BasketID + Item, data=TransEdgeList, sparse=TRUE)user20650

1 Answers

2
votes

I would manually build a sparse logical triplet matrix (ngTMatrix), convert it into a sparse ngCMatrix which can then be converted into a transactions object. This way there is never a full matrix representation created and you should be fine memory-wise.

 library(arules)
 library(Matrix)

 TransEdgeList <- data.frame(BasketID=c(1,1,2,2,3,3,3), 
   Item=c(10,11,10,12,10,11,13),
   Qty=c(1,1,2,3,1,2,1))

 m <- new("ngTMatrix", 
   i = as.integer(TransEdgeList$Item)-1L, 
   j = as.integer(TransEdgeList$BasketID)-1L, 
    Dim = as.integer(c(max(TransEdgeList$Item), max(TransEdgeList$BasketID))))

 m <- as(m, "ngCMatrix")

 tr <- as(m, "transactions")
 inspect(tr)

     items      itemsetID
 [1] {10,11}    1        
 [2] {10,12}    2        
 [3] {10,11,13} 3