I am kind of confused about why Spark's Mllib ETL functions, MinMaxScaler, for example, need vectors to be assembled instead of just using the column from the dataframe. I.e. instead of being able to do this:
scaler = MinMaxScaler(inputCol="time_since_live", outputCol="scaledTimeSinceLive")
main_df = scaler.fit(main_df).transform(main_df)
I need to do this:
assembler = VectorAssembler(inputCols=['time_since_live'],outputCol='time_since_liveVect')
main_df = assembler.transform(main_df)
scaler = MinMaxScaler(inputCol="time_since_liveVect", outputCol="scaledTimeSinceLive")
main_df = scaler.fit(main_df).transform(main_df)
It seems like such an unnecessary step because I end up creating a vector with one input column to run the MinMaxScaler on. Why does it need this to be in vector format instead of just a dataframe column?