4
votes

Please help to interpret the results of SPADE frequent sequence mining algorithm (http://www.inside-r.org/packages/cran/arulesSequences/docs/cspade​)

With support = 0.05:

s1 <- cspade(x, parameter = list(support = 0.05), control = list(verbose = TRUE))

I get, for example, these sequences:

4          <{C},{V}> 0.15644023
5            <{C,V}> 0.73127376

Looks like these are the same sequences, aren't they? How <{C},{V}> semantically differes from <{C,V}> ? Any real life examples?

From Spade paper (M. J. Zaki. (2001). SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning Journal, 42, 31--60):

"An input-sequence C is said to contain another sequence A, if A is a subsequence of the input-sequence C. The support or frequency of a sequence is the the total number of input-sequences in the database D that contain A."

Then, for example, if:

            sequence    support
1              <{C}> 1.00000000

Does it mean that sequence <{C}> is contained in all sequences in database D, correct?

Complete output that I get from my data:

> as(s1, "data.frame")
            sequence    support
1              <{C}> 1.00000000
2              <{L}> 0.20468120
3              <{V}> 0.73127376
4          <{C},{V}> 0.15644023
5            <{C,V}> 0.73127376
6            <{L,V}> 0.07882027
7          <{V},{V}> 0.13343431
8        <{C,V},{V}> 0.13343431
9      <{C},{C},{V}> 0.05558572
10         <{C,L,V}> 0.07882027
11       <{V},{C,V}> 0.13343431
12       <{C},{C,V}> 0.15644023
13     <{C,V},{C,V}> 0.13343431
14   <{C},{C},{C,V}> 0.05558572
15         <{C},{L}> 0.05738619
16           <{C,L}> 0.20468120
17       <{C},{C,L}> 0.05738619
18         <{C},{C}> 0.22128547
19         <{L},{C}> 0.06233031
20         <{V},{C}> 0.16921494
21     <{V},{V},{C}> 0.05047012
22     <{V},{C},{C}> 0.06233031
23       <{C,V},{C}> 0.16921494
24     <{C},{V},{C}> 0.05781487
25   <{C,V},{V},{C}> 0.05047012
26   <{V},{C,V},{C}> 0.05047012
27   <{C},{C,V},{C}> 0.05781487
28 <{C,V},{C,V},{C}> 0.05047012
29       <{C,L},{C}> 0.06233031
30     <{C},{C},{C}> 0.07882027
31   <{C,V},{C},{C}> 0.06233031
> summary(s1)
set of 31 sequences with

most frequent items:
      C       V       L (Other) 
     27      22       8       8 

most frequent elements:
    {C}     {V}   {C,V}     {L}   {C,L} (Other) 
     21      12      12       3       3       2 

element (sequence) size distribution:
sizes
 1  2  3 
 7 13 11 

sequence length distribution:
lengths
 1  2  3  4  5 
 3  9 12  6  1 

summary of quality measures:
    support       
 Min.   :0.05047  
 1st Qu.:0.05760  
 Median :0.07882  
 Mean   :0.17121  
 3rd Qu.:0.16283  
 Max.   :1.00000  

includes transaction ID lists: FALSE 

mining info:
 data ntransactions nsequences support
    x         61000      34991    0.05
> ​
2
Hello there! . Although ,I can create a transaction matrix from my data , I haven't been able to run spade due to the error invalid 'eid'. heres the thread , stackoverflow.com/questions/60034239/… ...would you please help me out??Devarshi Goswami

2 Answers

1
votes

When using SPADE algorithm, remember that you are also dealing with temporal data (i.e. you can know the order or time of occurrence of the item).

Looks like these are the same sequences, aren't they? How <{C},{V}> semantically differs from <{C,V}> ? Any real life examples?

In your example, <{C}, {V}> means that item C occurred first, and then item V; <{C, V}> means than item C and V occurred at the same time.

Then, for example, if:

            sequence    support
1              <{C}> 1.00000000

Does it mean that sequence <{C}> is contained in all sequences in database D, correct?

An item with support value of 1 means that it happened (in a market basket analysis example) in ALL transactions.

Hope this helps.

1
votes

Looks like these are the same sequences, aren't they? How <{C},{V}> semantically differes from <{C,V}> ? Any real life examples?

As user2552108 pointed, {C,V} implies that C and V occurred at the same time. In practice this can be used to encode multi-dimensional sequential data. For example, suppose that C was Canada and V was Vancouver. Now this could have been something like:

[{C,V,M,peanut,butter,maple_syrup}, ... , {}]

In this case, your frequent item-set can not only have single length sets like say {C}, {V}, {U}, {W}, or {X}, but also sets with length > 1 (the sets that appeared simultaneously - at the same time).

For this reason, the element in transactions/sequences are defined as sets and not single elements.

Does it mean that sequence <{C}> is contained in all sequences in database D, correct?

That's correct!