1
votes

I am kind of confused about why Spark's Mllib ETL functions, MinMaxScaler, for example, need vectors to be assembled instead of just using the column from the dataframe. I.e. instead of being able to do this:

scaler = MinMaxScaler(inputCol="time_since_live", outputCol="scaledTimeSinceLive")
main_df = scaler.fit(main_df).transform(main_df)

I need to do this:

assembler = VectorAssembler(inputCols=['time_since_live'],outputCol='time_since_liveVect')
main_df = assembler.transform(main_df)
scaler = MinMaxScaler(inputCol="time_since_liveVect", outputCol="scaledTimeSinceLive")
main_df = scaler.fit(main_df).transform(main_df)

It seems like such an unnecessary step because I end up creating a vector with one input column to run the MinMaxScaler on. Why does it need this to be in vector format instead of just a dataframe column?

1
Are you asking why vector is used in general or why spark uses itSalim

1 Answers

0
votes

In machine learning and pattern recognition, a set of such features is always represented as a Vector and it is called a “Feature Vector”. wiki read on feature and feature vector

Thus all the major ml libraries are API's are built to work with feature vectors

Now, the question has become more about where should we have the vector conversion step, should it be in the client code [as it is present now] or it should be within the API and client code should be able to call the API just by listing the feature columns. IMHO, we can have both, if you have some time to spare, you can add a new API to accept list of columns instead of a feature-vector and make a pull request. Lets see what the Spark community thinks about this