3
votes

Maybe I am just stupid but even then I would appreciate some helpful comments.

I do have a dataset containing IDs, articles and article attributes (i.e. groupings). Using RODBC, I fetch the table from a SQL Server database (with sqlFetch). Therefore, I couldn't find a way to use read.transactions from the arules-package. The format is constructed that each row is a ID to article+grouping combination. The table has 22 million rows with roughly 11 million unique IDs. The article base is around 500,000 articles. For illustration purposes, I created an example data frame like this

dat <- data.frame(id <- rep(paste0("id", 1:5),1, each=3),
                  article <- rep(paste0("A", sample(1051:1069,15,T))), 
                  groupA <- paste0(sample(LETTERS[1:3],15,TRUE), sample(1:3, 15,T))
)
colnames(dat) <- c("id", "article", "groupA")
dat$groupB[dat$groupA == "A1"|dat$groupA == "A2"|dat$groupA =="A3"] <- 1
dat$groupB[dat$groupA == "B1"|dat$groupA == "B2"|dat$groupA =="B3"] <- 2
dat$groupB[dat$groupA == "C1"|dat$groupA == "C2"|dat$groupA =="C3"] <- 3

resulting in a data.frame similar to this

    id article groupA groupB
1  id1   A1053     B3      2
2  id1   A1051     B1      2
3  id1   A1067     B2      2
4  id2   A1054     B1      2
5  id2   A1069     B1      2
6  id2   A1068     A3      1
7  id3   A1053     C3      3
8  id3   A1066     B3      2
9  id3   A1057     A2      1
10 id4   A1056     B1      2
11 id4   A1065     C1      3
12 id4   A1069     A3      1
13 id5   A1066     A3      1
14 id5   A1062     A3      1
15 id5   A1059     B2      2

I want to use the arules package for a market basket analysis. However, if I transfer the data.frame to a transactions object with txn <- as(dat, "transactions") every row is identified as one basket. Is there a possibility to point arules to the id variable as identifier? Basically, I am looking for the "single" option from the read.transactions function in the as -transformation. A second, related question would be: Can I use attributes with the arules package? I thought it would be need to perform an analysis with the apriori algorithm not only on the article level but also on the grouping levels. E.g. if I buy article A1053, I find a rule to recommend groupA B3. Is this even possible?

1
Same, exact question here...eflores89
@Dinesh Sure, this works, but it consumes a lot of memory if I work an Millions of rows, and actually takes a lot of time. So, while nice in theory and on small data, it is hard to implement in a productive environment.Daniel Schultz
In may experience it is totally worth the time to first dump the data into an optimal representation for analysis instead of trying to run complex algorithms via ODBC, SQL etc. because they add a very thick layer in the very wrong place. You may want to use some of the more advanced methods though, depending on your data distribution.Has QUIT--Anony-Mousse

1 Answers

2
votes

Short answer: split. I've looked into the read.transactions code (which is in R, so easily accessible for those familiar with R). It does some nice checks, but the essential lines of code seem to be:

entries <- split(entries[[cols[2]]], entries[[cols[1]]])
as(entries, "transactions")

In your example:

dat <- split(dat$article, dat$id)
ts <- as(dat, "transactions")

If you prefer, you can make a variant on the read.transactions function that reads from an RODBC connection instead of a flat file.