Can't plot linear regression predicted model against pandas dataframe

Question

I am trying to plot a predictive linear regression model against a data frame in pandas using the world bank API. I would like to use the independent variables to feed in and predict GDP growth against the date. More of a forecast but am really struggling. In addition the accuracy score is 1 which is rather strange as that would surely mean it is a perfect prediction? Here is what I have come up with so far:

#Connect to world bank api
!pip install wbdata

#Load libraries
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

#Load indicator data
indicators = {"NY.GDP.MKTP.CD": "GDP",
              "NE.CON.PRVT.ZS": "Households and NPISHs Final consumption expenditure (% of GDP)",
              "BX.KLT.DINV.WD.GD.ZS": "Foreign direct investment, net inflows (% of GDP)",
              "NE.CON.GOVT.ZS": "General government final consumption expenditure (% of GDP)",
              "NE.EXP.GNFS.ZS": "Exports of goods and services (% of GDP)",
              "NE.IMP.GNFS.ZS": "Imports of goods and services (% of GDP)" }

#Create dataframe
data = wbdata.get_dataframe(indicators, 
                            country=('GBR'), 
                            data_date=data_dates, 
                            convert_date=False, keep_levels=True)

#Round columns to 2dp
data1 = np.round(data, decimals=2)

#Convert datatype
data1['GDP'] = data1.GDP.astype(float)

#Format digits
data1['GDP'] = data1['GDP'].apply(lambda x: '{:.2f}'.format(x))

#Reset dataframe indexes
data1.reset_index(inplace=True) 

#Drop unused columns
data1.drop(data1.columns[[0]], axis=1, inplace=True)

#Converts all columns in dataframe to float datatypes
data1=data1.astype(float)

#data1.head(11)

#Dependent variable
Y = data1['GDP']

#Independent variable
X = data1[data1.columns[[1,2,3,4,5]]]

#Converts all columns in dataframe to float datatypes
data1=data1.astype(float)

#Create testing and training variables
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.1)

#Fit linear model
linear = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
predictions = lm.predict(X_test)

#Plot model
plt.scatter(y_test, predictions)
plt.xlabel('True Values')
plt.ylabel('Predictions')
plt.show()

#Print accuracy scores
accuracy = model.score(X_test, y_test)
print("Accuracy: ", accuracy)

What is data_dates? Your code has a few other errors that make it difficult to run. — xyzjayne
Your test set is too small with only five data points, so it's not incredibly difficult to score an accuracy of 1. — xyzjayne
The data_dates is the year value coming from the world bank api. the code runs fine for me? — Ryan
I would like to plot it on the following data1.plot.line(x='date', y='GDP') so the predicted values will be with the actual values? — Ryan
data_date=data_dates gives me undefined error. Pleas check. — xyzjayne

xyzjayne xyzjayne · Accepted Answer · 2018-07-19T15:06:05

The code was run and multiple issues were identified.

OP wanted to plot predicted y values against date of x_test.

As a result of this line: X = data1[data1.columns[[1,2,3,4,5]]]

x_test does not contain date (column 0) anymore. Run train_test_split(X, Y, test_size=0.1) with Xcontaining date to get the correct dates associated with each data point, and run the linear model with a copy of x_test with this column dropped (because date is not an independent variable).

High accuracy is due to the inclusion of dependent variable in independent variables.

X = data1[data1.columns[[1,2,3,4,5]]] actually contains 'GDP' and omits another possible independent variable. The recommended way would be explicitly dropping 'GDP' from the data.

Plotting a line chart with Pandas and a scatter plot in the same graph

OP wanted a line plot of actual GDP against year: data1.plot.line(x='date', y='GDP'), and later a scatter plot plt.scatter(X_test['date'], predictions). To do this, define an axes object with subplots and plot both on the same subplot.

f, ax = plt.subplots()
data1.plot.line(x='date', y='GDP', ax = ax)
ax.scatter(X_test['date'], predictions)
plt.show()

Can't plot linear regression predicted model against pandas dataframe

1 Answers