So far I have been able to tokenize all of my documents, and use CountVectorizer and IDF from Spark's MLLib. I am trying to get the top 50 words from each document, but I am not sure how to sort the output of IDF.
onePer is a dataframe of document IDs and tokenized documents.
val tf = new CountVectorizer()
.map{x:Row => x.getAs[Vector](0)}
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
This is what my output looks like (number of words in vocab, id of word, word score). I would like to sort by score and get the top k:
I was able to get this working by doing the following: => x.toSparse).map{x =>