After calculating the distance matrix related to a set of points stored in a file on HDFS, I need to store the calculated distance matrix which is in a distributed form (CoordinateMatrix/RowMatrix), in MongoDB through MongoDB Connector for Apache Spark. Is there a recommended way to do this or even a better connector for such an operation ?
Here is the part of my code:
val data = sc.textFile("hdfs://localhost:54310/usrp/copy_sample_data.txt")
val points = data.map(s => Vectors.dense(s.split(',').map(_.toDouble)))
val indexed = points.zipWithIndex()
val indexedData = indexed.map{case (value, index) => (index, value)}
val pairedSamples = indexedData.cartesian(indexedData)
val dist = pairedSamples.map{case (x,y) => ((x,y),distance(x._2,y._2))}.map{case ((x,y),z) => (((x,y),z,covariance(z)))}
val entries: RDD[MatrixEntry] = dist.map{case (((x,y),z,cov)) => MatrixEntry(x._1, y._1, cov)}
val coomat: CoordinateMatrix = new CoordinateMatrix(entries)
To further note, I have created this matrix in Spark from a RDD. So maybe it is even better/possible to save data from RDD to Mongodb ?