1
votes

I am trying to read a large matrix of doubles from a tab separated text file, row by row. This is in Scala/Apache Spark.

If I do the following:

val obs = sc.textFile("path_to_text_file")

I get obs: org.apache.spark.rdd.RDD[String]

However, the requirement is to have an RDD of vectors. Would you kindly help?

Thanks and regards,

1
Probably more info on what you have and what you want would help. (Separator character, is it row-wise or column-wise; RDD of vector of doubles, but row-wise, or column-wise?)Gábor Bakos
Thanks a lot, Gábor. I edited the question accordingly...learning_spark
More specifically, I get the following errors:learning_spark
[error] .../test/src/main/scala/mult_gaus.scala:22: type mismatch; [error] found : org.apache.spark.rdd.RDD[String] [error] required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]learning_spark

1 Answers

3
votes

Something like this might work for you:

final val SEPARATOR_AS_REGEX = ";"//Replace it with your separator regex
obs.map(line => line.split(SEPARATOR_AS_REGEX).map(
    //Parse the individual elements
    arr => arr.map(s => parseDouble(s))
  ).map(ds => new DenseVector(ds)))//Convert to the expected type