I am using Spark 1.5.0 (cdh5.5.2). I am running FpGrowth algorithm on my transactions data and I get different results each time. I checked my transactions data using linux diff command and found to have no difference. Is there any random seed involved in the fpgrowth function in Scala? Why am I getting different number of frequent item sets each time? Is there any tie that is broken randomly? Also, I use a support value that is really low - when I increase the support this problem doesn't exist. The support I use is 0.000459. When I increase this to 0.005 I am not getting the error. Is there any minimum threshold for support that needs to be used?
Thanks for your help.
Here is the code that I used:
val conf = new SparkConf() conf.registerKryoClasses(Array(classOf[ArrayBuffer[String]], classOf[ListBuffer[String]]))
val sc = new SparkContext(conf)
val data = sc.textFile("path/test_rdd.txt")
val transactions = data.map(x=>(x.split('\t')))
val transactioncount = transactions.count()
print(transactioncount)
print("\n")
transactions.cache()
val fpg = new FPGrowth().setMinSupport(0.000459)
val model = fpg.run(transactions)
print("\n")
print(model.freqItemsets.collect().length)
print("\n")
I am getting the same number in transactioncount. However, when I print the length of RDD that is output of FPGrowth, I am getting different numbers each time.