I can't understand what I am doing wrong with my training and my test data for a competition on Kaggle

Question

In the machine learning kaggle micro-courses you can find these datasets and code to help you making a prediction model for a competition: https://www.kaggle.com/ [put your user name here] /exercise-categorical-variables/edit

it gives you two datasets: 1 training dataset and 1 test dataset, which you will use to make your prediction and submit to see your ranking in the competition

So in:

Step 5: Generate test predictions and submit your results

I wrote this code: EDITED

# Read the data
X = pd.read_csv('../input/train.csv', index_col='Id') 
X_test = pd.read_csv('../input/test.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors
X.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X.SalePrice
X.drop(['SalePrice'], axis=1, inplace=True)

# To keep things simple, we'll drop columns with missing values
cols_with_missing = [col for col in X.columns if X[col].isnull().any()] 
X.drop(cols_with_missing, axis=1, inplace=True)
X_test.drop(cols_with_missing, axis=1, inplace=True)

#print(X_test.shape, X.shape)

X_test.head()

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y,
                                                      train_size=0.8, test_size=0.2,
                                                      random_state=0)

X_train.head()

#Asses Viability of method ONE-HOT
# Get number of unique entries in each column with categorical data
object_nunique = list(map(lambda col: X_train[col].nunique(), object_cols))
d = dict(zip(object_cols, object_nunique))

# Print number of unique entries by column, in ascending order
sorted(d.items(), key=lambda x: x[1])


# Columns that will be one-hot encoded   ####<<<<I THINK THAT THE PROBLEM STARTS HERE>>>>#####
low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]

##############For X_train

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

##############For X_test
low_cardinality_cols = [col for col in object_cols if X_test[col].nunique() < 10]

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
#Se não retirar os NAs a linha abaixo dá erro
X_test.dropna(axis = 0, inplace=True)
OH_cols_test = pd.DataFrame(OH_encoder.fit_transform(X_test[low_cardinality_cols]))
#print(OH_cols_test.shape, OH_cols_train.shape)

# One-hot encoding removed index; put it back
OH_cols_test.index = X_test.index


# Remove categorical columns (will replace with one-hot encoding)
num_X_test = X_test.drop(object_cols, axis=1)


# Add one-hot encoded columns to numerical features
OH_X_test = pd.concat([num_X_test, OH_cols_test], axis=1)

#print(OH_X_test.shape ,OH_X_valid.shape)

# Define and fit model
model = RandomForestRegressor(n_estimators=100, criterion='mae', random_state=0)
model.fit(OH_X_train, y_train)

# Get validation predictions and MAE
preds_test = model.predict(OH_X_test)


# Save predictions in format used for competition scoring
output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)

When I try to preprocess the datasets, I get different rows for training data and test data. Then I cant fit de model and make the prediction.

I think that I should split only the test dataset to make all that, but y has 1 more row than X_test and then I can't make the split.

So I thought that I had to use the training dataset to split and then fit it for the prediction of the test dataset

Could you fix the link to the competition? It currently returns a 404 error. — Robert Young
I changed the link format. kaggle.com`` [put your user name here]`` /exercise-categorical-variables/edit You have to log in in your kaggle account, copy your username and then paste it on [put your uset name here] in the link I provided — Apolo Reis

Robert Young Robert Young · Accepted Answer · 2020-08-12T08:00:13

I believe your problem is happening in this line:

low_cardinality_cols = [col for col in object_cols if X_test[col].nunique() < 10]

You are referencing X_test for your unique columns. Following the Kaggle tutorial you're supposed to be referencing X_train, like so:

# Columns that will be one-hot encoded
low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]

You also seem to make this same mistake further down in this line:

OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_test[low_cardinality_cols]))

You have labelled it as the one-hot encoded training columns, but you've used X_test instead of X_train. You're mixing up your training and testing set processing which is not a good idea. This line should be:

OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))

I'd advise you go over the code blocks in the tutorial again and make sure all your datasets and variables match up correctly, so that you're processing the correct training and testing data.

I can't understand what I am doing wrong with my training and my test data for a competition on Kaggle

1 Answers