Dataset limit dimension

Question

My aim is to produce significant rules on diagnostic data.

I preprocess my dataset in arff no sparse format; I have 116000 instances and 28 attribute. I apply Apriori algorithm in weka like this (using weka explorer interface)

Apriori -N 20 -T 1 -C 0.8 -D 0.05 -U 1.0 -M 0.1 -M 0.1 -S -1.0 -c -1

The algorithm seem to take a long time to finish. Currently are more than three hours that await the end.

Is normal? There is a way to speed up the algorithm (preprocess the data in some other way, or choose some other parameter for Apriori)? or weka is not the right tool for this dimension?

Unfortunately, Weka is often very slow. Try A) using ELKI, SPMF, or the C implementations by Christian Borgelt instead which should be much faster, B) the FPGrowth algorithm instead of Apriori. — Has QUIT--Anony-Mousse

shirowww shirowww · Accepted Answer · 2015-09-18T21:25:18

You could subsample or partition your dataset and run the Apriori algorithm on every partition or some of them, and then combine the obtained rules.

Some algorithms could take a long time to finish, and that's proportional to several factors (num instances, num attribs, tipe of attribs), depending on the algorithm (spatial and temporal computational complexity). Weka is not particularly fast, apart from being Java, which is also not so fast as other compiled languages.

Sometimes, it is faster to run several times an algorithm on much smaller partitions of your dataset, due to the mentioned comp. complexity.

For example, if your algorithm takes time proportional to the square of the number of instances, cN^2, it is faster to run 10 times that on a 10 times smaller partition, 10c((N^2)/(10^2))=0.1c(N^2)

Hope to have helped.

Dataset limit dimension

2 Answers