1
votes

I have data in a key,value pairing the key is the column index and the value is whatever is in that columns value. My original file is just a csv. So I have the following:

val myData = sc.textFile(file1)
  .map(x => x.split('|'))
  .flatMap(x => x.zipWithIndex)
  .map(x => x.swap)
  .groupByKey().cache

This puts my data into myData: Array[(Int, Iterable[String])]

val fpg = new FPGrowth()
  .setMinSupport(0.2)
  .setNumPartitions(1)

val model = fpg.run(myData)

I get the following issues:

<console>:29: error: inferred type arguments [Nothing,(Int, Iterable[String])] do not conform to method run's type parameter bounds [Item,Basket <: Iterable[Item]]

I am trying to learn how to use MlLib, and don't quite understand the issue. I've also tried removing the index and .map(x=>x._2) and making sets of just the iterable data but that also fails.

1

1 Answers

2
votes

This should solve your problem:

fpg.run(myData.values.map(_.toArray))

Basically FPGrowth requires an Array of Items. Passing output from groupByKey won't work because it contains Tuple2, output from map(x => x._2) won't work because value is not an Array.

Each element of the RDD represents a single basket and should contain only the unique items. If you expect duplicates you can use _.toSet.toArray or _distinct.toArray.