9
votes

I am looking to use the arulesSequences package in R. However, I have no idea as to how to coerce my data frame into an object that can leverage this package.

Here is a toy dataset that replicates my data structure:

ids <- c(rep("X", 5), rep("Y", 5), rep("Z", 5))
seq <- rep(1:5,3)
val <- sample(LETTERS, 15, replace=T)
df <- data.frame(ids, seq, val)
df

   ids seq val
1    X   1   T
2    X   2   H
3    X   3   V
4    X   4   A
5    X   5   X
6    Y   1   D
7    Y   2   B
8    Y   3   A
9    Y   4   D
10   Y   5   P
11   Z   1   Q
12   Z   2   R
13   Z   3   W
14   Z   4   W
15   Z   5   P

Any help will be greatly appreciated.

6
To be clear: this data frame represents three sequences? X="THVAX"; Y="DBADP"; Z=QRWWP"? (Why is it stored that way?)David Robinson
If I wanted to just use the arules package, I would only keep the ids and val column. Each of the 3 transactions (X/Y/Z) would have 5 items. Because I want to do sequence mining (factor in the order of each item), I need to have a sequence/timing variable. I am struggling with how to generate transactions that retain this "timing" component.Btibert3
Hi, Did you find an answer to this problem?Sir1

6 Answers

1
votes

Factor data frame:

df_fact = data.frame(lapply(df,as.factor))

Build "transaction" data:

df_trans = as(df_fact, 'transactions')

Test it:

itemFrequencyPlot(df_trans, support = 0.1, cex.names=0.8)
1
votes

By using read_baskets:

    read_baskets(con  = filePath.txt,
      sep = " ",
      info = c("sequenceID","eventID","SIZE"))

Which in practice means exporting the created data to a text-file and re-importing it through read_baskets. The info argument defines the first columns containing the sequenceID, eventID and an optional eventsize column.

1
votes

It worked for me add an essentially "order" column that lists a order ranking rather than a time value. You just have to be very specific in the naming convention. Try and name the "group" or "ordered basket #" variable sequenceID, and call the ranking or ordering eventID.

Another thing that helped me (and had me scratching my head for a long time) was that read_baskets() seemed to need me to specify

read_baskets(con  = filePath.txt, sep = " ", info = c("sequenceID","eventID","SIZE"))

Even though the help function makes the c() details seem like an optional header, it is not. I seemed to need to remove the header from my file and specify it in the read_baskets() command, or I'd run into problems.

0
votes

Instead of using the data frame, what worked best for me was to split the data into individual and than convert to transactions.

 eh$cost<-split(eh$cost$val ,eh$cost$id)
 eh$cost1<- as(eh$cost,"transactions")
0
votes

You have to first change your items into transactions so just coerce the column of items
trans = as(df[,'val'], "transactions")

then you can add the information to your transactions object

trans@itemsetInfo$transactionID = NULL trans@itemsetInfo$sequenceID = df$ids trans@itemsetInfo$eventID = df$seq

0
votes
df <- df %>% arrange(id,seq) %>% summarise(size=n(), items=list(val))

then write to txt (this tutorial also suggest that after a data wrangling write it then read it with read_basket function)

df$items <- as.character(df$items)
write.table(df, file = "trans.txt", sep = " ", row.names = FALSE, col.names = FALSE)

read the file and check it

x <- read_baskets("trans.txt", sep = " ", info = c("sequenceID","eventID","SIZE"))
as(x, "data.frame")