0
votes

I am trying to plot a predictive linear regression model against a data frame in pandas using the world bank API. I would like to use the independent variables to feed in and predict GDP growth against the date. More of a forecast but am really struggling. In addition the accuracy score is 1 which is rather strange as that would surely mean it is a perfect prediction? Here is what I have come up with so far:

#Connect to world bank api
!pip install wbdata

#Load libraries
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

#Load indicator data
indicators = {"NY.GDP.MKTP.CD": "GDP",
              "NE.CON.PRVT.ZS": "Households and NPISHs Final consumption expenditure (% of GDP)",
              "BX.KLT.DINV.WD.GD.ZS": "Foreign direct investment, net inflows (% of GDP)",
              "NE.CON.GOVT.ZS": "General government final consumption expenditure (% of GDP)",
              "NE.EXP.GNFS.ZS": "Exports of goods and services (% of GDP)",
              "NE.IMP.GNFS.ZS": "Imports of goods and services (% of GDP)" }

#Create dataframe
data = wbdata.get_dataframe(indicators, 
                            country=('GBR'), 
                            data_date=data_dates, 
                            convert_date=False, keep_levels=True)

#Round columns to 2dp
data1 = np.round(data, decimals=2)

#Convert datatype
data1['GDP'] = data1.GDP.astype(float)

#Format digits
data1['GDP'] = data1['GDP'].apply(lambda x: '{:.2f}'.format(x))

#Reset dataframe indexes
data1.reset_index(inplace=True) 

#Drop unused columns
data1.drop(data1.columns[[0]], axis=1, inplace=True)

#Converts all columns in dataframe to float datatypes
data1=data1.astype(float)

#data1.head(11)

#Dependent variable
Y = data1['GDP']

#Independent variable
X = data1[data1.columns[[1,2,3,4,5]]]

#Converts all columns in dataframe to float datatypes
data1=data1.astype(float)

#Create testing and training variables
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.1)

#Fit linear model
linear = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
predictions = lm.predict(X_test)

#Plot model
plt.scatter(y_test, predictions)
plt.xlabel('True Values')
plt.ylabel('Predictions')
plt.show()

#Print accuracy scores
accuracy = model.score(X_test, y_test)
print("Accuracy: ", accuracy)
1
What is data_dates? Your code has a few other errors that make it difficult to run.xyzjayne
Your test set is too small with only five data points, so it's not incredibly difficult to score an accuracy of 1.xyzjayne
The data_dates is the year value coming from the world bank api. the code runs fine for me?Ryan
I would like to plot it on the following data1.plot.line(x='date', y='GDP') so the predicted values will be with the actual values?Ryan
data_date=data_dates gives me undefined error. Pleas check.xyzjayne

1 Answers

1
votes

The code was run and multiple issues were identified.

  1. OP wanted to plot predicted y values against date of x_test.

As a result of this line: X = data1[data1.columns[[1,2,3,4,5]]]

x_test does not contain date (column 0) anymore. Run train_test_split(X, Y, test_size=0.1) with Xcontaining date to get the correct dates associated with each data point, and run the linear model with a copy of x_test with this column dropped (because date is not an independent variable).

  1. High accuracy is due to the inclusion of dependent variable in independent variables.

X = data1[data1.columns[[1,2,3,4,5]]] actually contains 'GDP' and omits another possible independent variable. The recommended way would be explicitly dropping 'GDP' from the data.

  1. Plotting a line chart with Pandas and a scatter plot in the same graph

OP wanted a line plot of actual GDP against year: data1.plot.line(x='date', y='GDP'), and later a scatter plot plt.scatter(X_test['date'], predictions). To do this, define an axes object with subplots and plot both on the same subplot.

f, ax = plt.subplots()
data1.plot.line(x='date', y='GDP', ax = ax)
ax.scatter(X_test['date'], predictions)
plt.show()