I am trying to process each row in a Spark dataframe, and convert this to another data frame. Essentially, I have a frame A which contains a column ("id") and another column which is array of sentences. I'd like to convert this to another dataframe with each sentence uniquely identified with a "docID:count" identifier string. My code is:
var sentencesCollection:Seq[SentenceIdentifier] = Seq()
tokenized.foreach(row => {
val docID = row.getAs[String]("id")
val sentences = row.getAs[Seq[String]]("sentences")
var count:Integer = 0
for (elem <- sentences) {
val sentenceID:String = docID + ":" + count
count = count + 1
val si = SentenceIdentifier(sentenceID, elem)
sentencesCollection = sentencesCollection :+ si
}
})
println(sentencesCollection.length)
However, the println statement prints "0".
Any idea how I can have sentencesCollection be the sequence that I can further process downstream ? (Possibly thought a .toDF() call).
foreachis executed by the Executors (probably in another machine in another jvm) while yourSeqonly lives in the Driver. This has bee asked before uncountable amount of times. If you want to have a local copy of your data, usecollect. - However, since what you want to do is create a newDataFrame, create a local collection would be a waste of memory and a performance bottle neck. There are a lot of methods to create a newDataFramefrom another, like select, filter or map (onDataset). I think you should read more about how Spark works - Luis Miguel Mejía Suárez