1
votes

My aim is to produce significant rules on diagnostic data.

I preprocess my dataset in arff no sparse format; I have 116000 instances and 28 attribute. I apply Apriori algorithm in weka like this (using weka explorer interface)

Apriori -N 20 -T 1 -C 0.8 -D 0.05 -U 1.0 -M 0.1 -M 0.1 -S -1.0 -c -1

The algorithm seem to take a long time to finish. Currently are more than three hours that await the end.

Is normal? There is a way to speed up the algorithm (preprocess the data in some other way, or choose some other parameter for Apriori)? or weka is not the right tool for this dimension?

2
Unfortunately, Weka is often very slow. Try A) using ELKI, SPMF, or the C implementations by Christian Borgelt instead which should be much faster, B) the FPGrowth algorithm instead of Apriori.Has QUIT--Anony-Mousse

2 Answers

0
votes

You could subsample or partition your dataset and run the Apriori algorithm on every partition or some of them, and then combine the obtained rules.

Some algorithms could take a long time to finish, and that's proportional to several factors (num instances, num attribs, tipe of attribs), depending on the algorithm (spatial and temporal computational complexity). Weka is not particularly fast, apart from being Java, which is also not so fast as other compiled languages.

Sometimes, it is faster to run several times an algorithm on much smaller partitions of your dataset, due to the mentioned comp. complexity.

For example, if your algorithm takes time proportional to the square of the number of instances, cN^2, it is faster to run 10 times that on a 10 times smaller partition, 10c((N^2)/(10^2))=0.1c(N^2)

Hope to have helped.

0
votes

Weka like many other data mining libraries only offer the two most famous algorithms: Apriori and FPGrowth. Apriori is an old algorithm is well known for being inefficient. Moreover, the Weka implementation of both Apriori and FPGrowth are slow.

If you want better Java implementations and more algorithms, you can check the SPMF open-source data mining library (I'm the founder), which offers the largest collection of pattern mining algorithms (more than 110 algorithms). For itemset mining, it offers Apriori and FPGrowth but also many other algorithms such as Eclat (2000) HMine(2005), LCM (the fastest at the FIMI 2004 competition) and some newer such as FIN (2014), PrePost (2014) and Prepost+ (2015), which can be faster than previous algorithms. Besides, it also offers many variations of these algorithms such as for mining rare itemsets, correlated itemsets, high utility itemsets, itemsets in uncertaint data, association rules, closed patterns, sequential patterns, sequential rules, etc.

There are some performance evaluation on the website that show that the SPMF implementations are much faster than the one of Weka for Apriori/FPGrowth.