Converting Dataframe Columns to Vectors in Spark

Question

I am new to spark and am trying to use some of the MLlib functions to help me on a school project. All the documentation for how to do analytics with MLlib seems to use vectors and I was wondering if I could just configure what I wanted to do to a data frame instead of a vector in spark.

For example in the documentation for scala for doing PCA is:

"val data = Array(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
val pca = new PCA().fit(df)"

etc.... the for that is here: https://spark.apache.org/docs/latest/ml-features.html#pca

Is there a way I dont have to create these vectors and just configure it to the dataframe I have already created. The dataframe I have already created has 50+ columns and 15,000+ rows so making vectors for each column isnt really feasible. Does anyone have any ideas or suggestions. Lastly, unfortunately for my project I am limited to using Spark in Scala I am not allowed to use Pyspark, Java for Spark, or SparkR. If anything was unclear please let me know. Thanks!

ookboy24 ookboy24 · Accepted Answer · 2018-11-09T08:11:04

What you are looking for is the vector assembler transformer which takes an array of data frame columns and produces a single vector column and then you can use an ML pipeline with the assembler and PCA.

Help docs are here

vector assembler: https://spark.apache.org/docs/latest/ml-features.html#vectorassembler
ml pipeline: https://spark.apache.org/docs/latest/ml-pipeline.html

If you need more than PCA you can use low-level RDD transformations.

Converting Dataframe Columns to Vectors in Spark

1 Answers