
I am trying to read a large matrix of doubles from a tab separated text file, row by row. This is in Scala/Apache Spark.

If I do the following:

val obs = sc.textFile("path_to_text_file")

I get obs: org.apache.spark.rdd.RDD[String]

However, the requirement is to have an RDD of vectors. Would you kindly help?

Thanks and regards,

Probably more info on what you have and what you want would help. (Separator character, is it row-wise or column-wise; RDD of vector of doubles, but row-wise, or column-wise?)Gábor Bakos
Thanks a lot, Gábor. I edited the question accordingly...learning_spark
More specifically, I get the following errors:learning_spark
[error] .../test/src/main/scala/mult_gaus.scala:22: type mismatch; [error] found : org.apache.spark.rdd.RDD[String] [error] required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]learning_spark

1 Answers


Something like this might work for you:

final val SEPARATOR_AS_REGEX = ";"//Replace it with your separator regex
obs.map(line => line.split(SEPARATOR_AS_REGEX).map(
    //Parse the individual elements
    arr => arr.map(s => parseDouble(s))
  ).map(ds => new DenseVector(ds)))//Convert to the expected type