Please help to interpret the results of SPADE frequent sequence mining algorithm (http://www.inside-r.org/packages/cran/arulesSequences/docs/cspade)
With support = 0.05:
s1 <- cspade(x, parameter = list(support = 0.05), control = list(verbose = TRUE))
I get, for example, these sequences:
4 <{C},{V}> 0.15644023
5 <{C,V}> 0.73127376
Looks like these are the same sequences, aren't they? How <{C},{V}> semantically differes from <{C,V}> ? Any real life examples?
From Spade paper (M. J. Zaki. (2001). SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning Journal, 42, 31--60):
"An input-sequence C is said to contain another sequence A, if A is a subsequence of the input-sequence C. The support or frequency of a sequence is the the total number of input-sequences in the database D that contain A."
Then, for example, if:
sequence support
1 <{C}> 1.00000000
Does it mean that sequence <{C}> is contained in all sequences in database D, correct?
Complete output that I get from my data:
> as(s1, "data.frame")
sequence support
1 <{C}> 1.00000000
2 <{L}> 0.20468120
3 <{V}> 0.73127376
4 <{C},{V}> 0.15644023
5 <{C,V}> 0.73127376
6 <{L,V}> 0.07882027
7 <{V},{V}> 0.13343431
8 <{C,V},{V}> 0.13343431
9 <{C},{C},{V}> 0.05558572
10 <{C,L,V}> 0.07882027
11 <{V},{C,V}> 0.13343431
12 <{C},{C,V}> 0.15644023
13 <{C,V},{C,V}> 0.13343431
14 <{C},{C},{C,V}> 0.05558572
15 <{C},{L}> 0.05738619
16 <{C,L}> 0.20468120
17 <{C},{C,L}> 0.05738619
18 <{C},{C}> 0.22128547
19 <{L},{C}> 0.06233031
20 <{V},{C}> 0.16921494
21 <{V},{V},{C}> 0.05047012
22 <{V},{C},{C}> 0.06233031
23 <{C,V},{C}> 0.16921494
24 <{C},{V},{C}> 0.05781487
25 <{C,V},{V},{C}> 0.05047012
26 <{V},{C,V},{C}> 0.05047012
27 <{C},{C,V},{C}> 0.05781487
28 <{C,V},{C,V},{C}> 0.05047012
29 <{C,L},{C}> 0.06233031
30 <{C},{C},{C}> 0.07882027
31 <{C,V},{C},{C}> 0.06233031
> summary(s1)
set of 31 sequences with
most frequent items:
C V L (Other)
27 22 8 8
most frequent elements:
{C} {V} {C,V} {L} {C,L} (Other)
21 12 12 3 3 2
element (sequence) size distribution:
sizes
1 2 3
7 13 11
sequence length distribution:
lengths
1 2 3 4 5
3 9 12 6 1
summary of quality measures:
support
Min. :0.05047
1st Qu.:0.05760
Median :0.07882
Mean :0.17121
3rd Qu.:0.16283
Max. :1.00000
includes transaction ID lists: FALSE
mining info:
data ntransactions nsequences support
x 61000 34991 0.05
>