I have a CSV file with the following format :
product_id1,product_title1
product_id2,product_title2
product_id3,product_title3
product_id4,product_title4
product_id5,product_title5
[...]
The product_idX is a integer and the product_titleX is a String, example :
453478692, Apple iPhone 4 8Go
I'm trying to create the TF-IDF from my file so I can use it for a Naive Bayes Classifier in MLlib.
I am using Spark for Scala so far and using the tutorials I have found on the official page and the Berkley AmpCamp 3 and 4.
So I'm reading the file :
val file = sc.textFile("offers.csv")
Then I'm mapping it in tuples RDD[Array[String]]
val tuples = file.map(line => line.split(",")).cache
and after I'm transforming the tuples into pairs RDD[(Int, String)]
val pairs = tuples.(line => (line(0),line(1)))
But I'm stuck here and I don't know how to create the Vector from it to turn it into TFIDF.
Thanks