0
votes

I am using Spark 1.5.0 (cdh5.5.2). I am running FpGrowth algorithm on my transactions data and I get different results each time. I checked my transactions data using linux diff command and found to have no difference. Is there any random seed involved in the fpgrowth function in Scala? Why am I getting different number of frequent item sets each time? Is there any tie that is broken randomly? Also, I use a support value that is really low - when I increase the support this problem doesn't exist. The support I use is 0.000459. When I increase this to 0.005 I am not getting the error. Is there any minimum threshold for support that needs to be used?

Thanks for your help.

Here is the code that I used:

val conf = new SparkConf() conf.registerKryoClasses(Array(classOf[ArrayBuffer[String]], classOf[ListBuffer[String]]))

val sc = new SparkContext(conf)

val data = sc.textFile("path/test_rdd.txt")
val transactions = data.map(x=>(x.split('\t')))
val transactioncount = transactions.count()
print(transactioncount)
print("\n")
transactions.cache()
val fpg = new FPGrowth().setMinSupport(0.000459)
val model = fpg.run(transactions)
print("\n")
print(model.freqItemsets.collect().length)
print("\n")

I am getting the same number in transactioncount. However, when I print the length of RDD that is output of FPGrowth, I am getting different numbers each time.

1
We would love to help you, but if there isn't a reproducible example, then it will be impossible for us. - Alberto Bonsanto
FPgrowth should return the exact same result every time. Fire up a debugger - there could be a bug, on your side or in Mllib. - Has QUIT--Anony-Mousse
It is an 80MB file. I tried testing only on the first 1000 transactions and I couldn't reproduce it Alberto Bonsanto. I will try to see if I could simulate the problem with less number of records. I am currently in the process of debugging but for the same input file and same cutoffs, I get different output. When support is a bit higher there is no issue. - user1050325

1 Answers

0
votes

The problem was that Cloudera by default has Kryo Serializer turned on. Spark downloads (individually) have Java Serializer by default. When I run FPGrowth using Kryo Serializer it will ask to register Kryo classes. Once I do that there are no errors popping up. However, the results are incorrect. Once I changed it back to Java Serializer, the results were correct and match those from Spark 1.6.0. I still do not know if the problem is in FPGrowth function itself or Kryo serialization affects other functions/libraries as well.