Scala/Spark - Create Dataset with one column from another Dataset

Question

I am trying to create a Dataset with only one column from Case Class.

Below is the code:

case class vectorData(value: Array[String], vectors: Vector)


def main(args: Array[String]) {
    val spark = SparkSession.builder
      .appName("Hello world!")
      .master("local[*]")
      .getOrCreate()
    import spark.implicits._
    //blah blah and read data etc. 
    val word2vec = new Word2Vec()
        .setInputCol("value").setOutputCol("vectors")
        .setVectorSize(5).setMinCount(0).setWindowSize(5)
    val dataset = spark.createDataset(data)

    val model = word2vec.fit(dataset)


    val encoder = org.apache.spark.sql.Encoders.product[vectorData]
    val result = model.transform(dataset).as(encoder)

    //val output: Dataset[Vector]  = ???
}

As shown in last line of the code, I want the output to be only the 2nd column which has Vector type with vectors data.

I tried with:

val output = result.map(o => o.vectors)

But this line highlighted error No implicit arguments of type: Encoder[Vector]

How to resolve this?

@Prateek result.select("vectors") creates a sql.DataFrame, but not Dataset[Vector]. Any ideas? — xzk

Boris Azanov Boris Azanov · Accepted Answer · 2020-10-12T11:09:53

I think line:

implicit val vectorEncoder: Encoder[Vector] = org.apache.spark.sql.Encoders.product[Vector]

should make

val output = result.map(o => o.vectors)

correct

Scala/Spark - Create Dataset with one column from another Dataset

1 Answers