I am working in the Scala Spark Shell and have the following RDD:
scala> docsWithFeatures
res10: org.apache.spark.rdd.RDD[(Long, org.apache.spark.mllib.linalg.Vector)] = MapPartitionsRDD[162] at repartition at <console>:9
I previously saved this to text using:
docsWithFeatures.saveAsTextFile("path/to/file")
Here's an example line from the text file (which I've shortened here for readability):
(22246418,(112312,[4,11,14,15,19,...],[109.0,37.0,1.0,3.0,600.0,...]))
Now, I know I could have saved this as object file to simplify things, but the raw text format is better for my purposes.
My question is, what is the proper way to get this text file back into an RDD of the same format as above (i.e. RDD of (integer, sparse vector) tuples)? I'm assuming I jut need to load with sc.textFile
and then apply a couple mapping functions, but I'm very new to Scala and not sure how to go about it.