SKLearn Predicting using new Data

Question

I've tried out Linear Regression using SKLearn. I have data something along the lines of: Calories Eaten | Weight.

150 | 150

300 | 190

350 | 200

Basically made up numbers but I've fit the dataset into the linear regression model.

What I'm confused on is, how would I go about predicting with new data, say I got 10 new numbers of Calories Eaten, and I want it to predict Weight?

regressor = LinearRegression()
regressor.fit(x_train, y_train)
y_pred = regressor.predict(x_test) ??

But how would I go about making only my 10 new data numbers of Calories Eaten and make it the Test Set I want the regressor to predict?

your question isnt clear, are you confused in generating the numbers or making array? — A.B
Nope. I'm confused on how I would have my model predict the Weight using 10 new numbers of Calorie Eaten. — GuiltyVeek

David David · Accepted Answer · 2018-05-04T18:47:53

You are correct, you simply call the predict method of your model and pass in the new unseen data for prediction. Now it also depends on what you mean by new data. Are you referencing data that you do not know the outcome of (i.e. you do not know the weight value), or is this data being used to test the performance of your model?

For new data (to predict on):

Your approach is correct. You can access all predictions by simply printing the y_pred variable.

You know the respective weight values and you want to evaluate model:

Make sure that you have two separate data sets: x_test (containing the features) and y_test (containing the labels). Generate the predictions as you are doing with the y_pred variable, then you can calculate its performance using a number of performance metrics. Most common one is the root mean square, and you simply pass the y_test and y_pred as parameters. Here is a list of all the regression performance metrics supplied by sklearn.

If you do not know the weight value of the 10 new data points:

Use train_test_split to split your initial data set into 2 parts: training and testing. You would have 4 datasets: x_train, y_train, x_test, y_test.

from sklearn.model_selection import train_test_split
# random state can be any number (to ensure same split), and test_size indicates a 25% cut
x_train, y_train, x_test, y_test = train_test_split(calories_eaten, weight, test_size = 0.25, random_state = 42)

Train model by fitting x_train and y_train. Then evaluate model's training performance by predicting on x_test and comparing these predictions with the actual results from y_test. This way you would have an idea of how the model performs. Furthermore, you can then predict the weight values for the 10 new data points accordingly.

It is also worth reading further on the topic as a beginner. This is a simple tutorial to follow.

SKLearn Predicting using new Data

3 Answers