0
votes

I have a CSV file containing following data with 9000+ records

 id,Category1,Category2

How do I convert this csv file to RDD<Vector> so that I can use it to find similar column using columnSimilarities of Apache Spark in java.

https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html#RowMatrix-org.apache.spark.rdd.RDD-

2

2 Answers

0
votes

as I read, Vector can hold the ID and and double[] for the values. you need to fill the Vector.

List<String> lines = Files.readAllLines(Paths.get("myfile.csv"), Charset.defaultCharset());

then you can iterate over lines, create a Vector for each line, fill it with the values (you need to parse them) and add them to the RDD

0
votes

You can try this:

sparkSession.read.csv(myCsvFilePath) // you should have a DataFrame here
  .map((r: Row) => Vector.dense(r.getInt(0), r.getInt(1), r.getInt(2))) // you should have a Dataset of Vector
  .rdd // you have your RDD[Vector]

Feel free to reach out if it doesn't work.