Dividing the dataset in training and testing data with lables

Question

I am trying to divide the dataset to training and testing set, in below code, df_min_max_scaled is my normalized data, df is my unnormalized data, but I am getting error

import numpy as np
train_ind = df.sample(frac=0.65, replace=True)
train = df_min_max_scaled[train_ind,]
test = df_min_max_scaled[-train_ind,]
train_labels = df[train_ind, 12] 
test_labels = df[-train_ind, 12]
#train_labels

Error:

TypeError                                 Traceback (most recent call last)
<ipython-input-50-a640d18b42fc> in <module>
      1 import numpy as np
      2 train_ind = df.sample(frac=0.65, replace=True)
----> 3 train = df_min_max_scaled[train_ind,]
      4 test = df_min_max_scaled[-train_ind,]
      5 train_labels = df[train_ind, 12]

Its showing error on line-3, I am actually converting R code to Python using Pandas

train_ind = sample(nrow(wine), floor(0.65 * nrow(wine)))
train = wine2[train_ind,]
test = wine2[-train_ind,]
train_labels = wine[train_ind, 12] 
test_labels = wine[-train_ind, 12]

That is not how you subset data in python. which module/package are you using in python - pandas, vaex, spark etc? Seems you are an R person. You should either stick to R or simply learn python basics. For example you cannot write R code in C++ and expect the c++ code to run, you simply have to learn the syntax of that language. — onyambu
train_ind in pandas is already the train dataset. Probably you need train_ind = np.random.choice(wine.shape[0], np.floor(0.65 * wine.shape[0])) — onyambu
Thank you for the help then train and test remain the same ? — AHF

Nerxis Nerxis · Accepted Answer · 2021-06-07T12:34:16

I would recommend you to use train_test_split from sklearn. This could contain following steps:

Load your data (e.g. df = pd.read_csv(...) if your data comes from CSV files)
Split them using train test split (from sklearn.model_selection import train_test_split), where df are your inputs and labels are true targets (you can set test_size to any value you want).

train, test, train_labels, test_labels = train_test_split(df, labels, test_size=0.35)

Apply your scaling (I recommend applying scaling after splitting so you can fit e.g. min-max scaler on train data, not both)

If you really insist on using sample function from pandas, you could do the following:

get train data: train = df.sample(frac=0.65)
get test data (rest of the 35% of data): test = df.drop(train.index)
labels can be extracted by train_labels = train.iloc[12] (if I understand you correctly that 12 is the position of label in dataframe columns)
and/or finally the right scaled data can be accessed via df_min_max_scaled.loc[train.index]

Just ensure you use the same index for both scaled/unscaled data.

Dividing the dataset in training and testing data with lables

1 Answers