I am new to spark and am trying to use some of the MLlib functions to help me on a school project. All the documentation for how to do analytics with MLlib seems to use vectors and I was wondering if I could just configure what I wanted to do to a data frame instead of a vector in spark.
For example in the documentation for scala for doing PCA is:
"val data = Array(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
val pca = new PCA().fit(df)"
etc.... the for that is here: https://spark.apache.org/docs/latest/ml-features.html#pca
Is there a way I dont have to create these vectors and just configure it to the dataframe I have already created. The dataframe I have already created has 50+ columns and 15,000+ rows so making vectors for each column isnt really feasible. Does anyone have any ideas or suggestions. Lastly, unfortunately for my project I am limited to using Spark in Scala I am not allowed to use Pyspark, Java for Spark, or SparkR. If anything was unclear please let me know. Thanks!