I am trying to classify 4 classes from the following DataFrame using one-hot encoding in scikit-learn:
K T_STAR REGIME
15 90.929 0.95524 BoilingInducedBreakup
9 117.483 0.89386 Splash
16 97.764 1.17972 BoilingInducedBreakup
13 76.917 0.91399 BoilingInducedBreakup
6 44.889 0.95725 BoilingInducedBreakup
20 151.662 0.56287 Splash
12 67.155 1.22842 ReboundWithBreakup
7 114.747 0.47618 Splash
17 121.731 0.52956 Splash
12 29.397 0.88702 Deposition
14 31.733 0.69154 Deposition
13 119.433 0.39422 Splash
21 97.913 1.21309 ReboundWithBreakup
10 117.544 0.18538 Splash
27 76.957 0.52879 Deposition
22 155.842 0.17559 Splash
3 25.620 0.18680 Deposition
30 151.773 1.23027 ReboundWithBreakup
34 91.146 0.90138 Deposition
19 58.095 0.46110 Deposition
14 85.596 0.97520 BoilingInducedBreakup
41 97.783 0.16985 Deposition
0 16.683 0.99355 Deposition
28 122.022 1.22977 ReboundWithBreakup
0 25.570 1.24686 ReboundWithBreakup
3 113.315 0.48886 Splash
7 31.873 1.30497 ReboundWithBreakup
0 108.488 0.73423 Splash
2 25.725 1.29953 ReboundWithBreakup
37 97.695 0.50930 Deposition
Here is the sample as CSV:
,K,T_STAR,REGIME
15,90.929,0.95524,BoilingInducedBreakup
9,117.483,0.89386,Splash
16,97.764,1.17972,BoilingInducedBreakup
13,76.917,0.91399,BoilingInducedBreakup
6,44.889,0.95725,BoilingInducedBreakup
20,151.662,0.56287,Splash
12,67.155,1.22842,ReboundWithBreakup
7,114.747,0.47618,Splash
17,121.731,0.52956,Splash
12,29.397,0.88702,Deposition
14,31.733,0.69154,Deposition
13,119.433,0.39422,Splash
21,97.913,1.21309,ReboundWithBreakup
10,117.544,0.18538,Splash
27,76.957,0.52879,Deposition
22,155.842,0.17559,Splash
3,25.62,0.1868,Deposition
30,151.773,1.23027,ReboundWithBreakup
34,91.146,0.90138,Deposition
19,58.095,0.4611,Deposition
14,85.596,0.9752,BoilingInducedBreakup
41,97.783,0.16985,Deposition
0,16.683,0.99355,Deposition
28,122.022,1.22977,ReboundWithBreakup
0,25.57,1.24686,ReboundWithBreakup
3,113.315,0.48886,Splash
7,31.873,1.30497,ReboundWithBreakup
0,108.488,0.73423,Splash
2,25.725,1.29953,ReboundWithBreakup
37,97.695,0.5093,Deposition
Features vector is two-dimensional (K,T_STAR)
and REGIMES
are the categories, that are not ordered in any way.
This is what I did so far for one-hot encoding and scaling:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
num_attribs = ["K", "T_STAR"]
cat_attribs = ["REGIME"]
preproc_pipeline = ColumnTransformer([("num", MinMaxScaler(), num_attribs),
("cat", OneHotEncoder(), cat_attribs)])
regimes_df_prepared = preproc_pipeline.fit_transform(regimes_df)
However, when I print a few first lines of regimes_df_prepared
I get
array([[0.73836403, 0.19766192, 0. , 0. , 0. ,
1. ],
[0.43284301, 0.65556065, 1. , 0. , 0. ,
0. ],
[0.97076007, 0.93419198, 0. , 0. , 1. ,
0. ],
[0.96996242, 0.34623652, 0. , 0. , 0. ,
1. ],
[0.10915571, 1. , 0. , 0. , 1. ,
0. ]])
So the one-hot encoding seems to have worked, but the problem is that the feature vectors are packed together with the encoding in this array.
If I try to train the model like this:
from sklearn.linear_model import LogisticRegression
logreg_ovr = LogisticRegression(solver='lbfgs', max_iter=10000, multi_class='ovr')
logreg_ovr.fit(regimes_df_prepared, regimes_df["REGIME"])
print("Model training score : %.3f" % logreg_ovr.score(regimes_df_prepared, regimes_df["REGIME"]))
The score is 1.0
, which can't be (overfitting?).
Now I want the model to predict a category at a (K, T_STAR) pair
logreg_ovr.predict([[40,0.6]])
And I get an error
ValueError: X has 2 features per sample; expecting 6
as suspected, the model sees the entire row of regimes_df_prepared
as a feature vector. How can I avoid this?