0
votes

I'm trying to apply Gaussian Naive Bayes model on a dataset to predict disease. It's running correctly when I'm predicting using training data, but when I'm trying to predict using testing data It's giving ValueError.

runfile('D:/ROFI/ML/Heart Disease/prediction.py', wdir='D:/ROFI/ML/Heart Disease') Traceback (most recent call last):

File "", line 1, in runfile('D:/ROFI/ML/Heart Disease/prediction.py', wdir='D:/ROFI/ML/Heart Disease')

File "C:\Users\User\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile execfile(filename, namespace)

File "C:\Users\User\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile exec(compile(f.read(), filename, 'exec'), namespace)

File "D:/ROFI/ML/Heart Disease/prediction.py", line 85, in predict(x_train, y_train, x_test, y_test)

File "D:/ROFI/ML/Heart Disease/prediction.py", line 73, in predict predicted_data = model.predict(x_test)

File "C:\Users\User\Anaconda3\lib\site-packages\sklearn\naive_bayes.py", line 65, in predict jll = self._joint_log_likelihood(X)

File "C:\Users\User\Anaconda3\lib\site-packages\sklearn\naive_bayes.py", line 429, in _joint_log_likelihood n_ij -= 0.5 * np.sum(((X - self.theta_[i, :]) ** 2) /

ValueError: operands could not be broadcast together with shapes (294,14) (15,)

What's wrong here ?

import pandas
from sklearn import metrics
from sklearn.preprocessing import Imputer
from sklearn.naive_bayes import GaussianNB    

def load_data(feature_columns, predicted_column):

    train_data_frame = pandas.read_excel("training_data.xlsx")
    test_data_frame = pandas.read_excel("testing_data.xlsx")
    data_frame = pandas.read_excel("data_set.xlsx")

    x_train = train_data_frame[feature_columns].values
    y_train = train_data_frame[predicted_column].values

    x_test = test_data_frame[feature_columns].values
    y_test = test_data_frame[predicted_column].values

    x_train, x_test = impute(x_train, x_test)

    return x_train, y_train, x_test, y_test


def impute(x_train, x_test):

    fill_missing = Imputer(missing_values=-9, strategy="mean", axis=0)

    x_train = fill_missing.fit_transform(x_train)
    x_test = fill_missing.fit_transform(x_test)

    return x_train, x_test


def predict(x_train, y_train, x_test, y_test):

    model = GaussianNB()
    model.fit(x_train, y_train.ravel())

    predicted_data = model.predict(x_test)
    accuracy = metrics.accuracy_score(y_test, predicted_data)
    print("Accuracy of our naive bayes model is : %.2f"%(accuracy * 100))

    return predicted_data


feature_columns = ["age", "sex", "chol", "cigs", "years", "fbs", "trestbps", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num"]
predicted_column = ["cp"]

x_train, y_train, x_test, y_test = load_data(feature_columns, predicted_column)

predict(x_train, y_train, x_test, y_test)

N.B: Both file has same number of columns.

1
Can you post the full stack trace?EFT
@EFT I've posted full traceback. Btw, I just found out that Imputer is deleting one column because it's entirely composed of missing value. Is there any way to prevent this ?MD. Khairul Basar
no one here has the file you're using with read_excel("training_data.xlsx"). can you reproduce this issue with public datasets?Max Power
Deleting a column certainly seems like the sort of thing that would create the mismatch you're seeing. You could try filling the column in beforehand, or adding it back afterward. The documentation scikit-learn.org/stable/modules/generated/… doesn't seem to give a way to keep something empty. I suppose you could at least get it to give an error when this happens by using Imputer(..., axis=1) on the transpose of the array you currently feed it.EFT
@MD.KhairulBasar That's good & makes sense. Since you located that before anyone else posted about it, it might be good for you to add it as answer and accept it when you can, so that anyone coming across this error in the future can see before clicking that a solution was found, and then find out what it was without digging through the comments.EFT

1 Answers

1
votes

I found the bug. The error is occurring because of Imputer. Imputer replaces the missing value in data set. But, if any column is entirely composed of missing value then it deletes that column. I had a column full of missing data entirely in testing data set. So, Imputer was deleting that and thus shape didn't match with training data and that's the reason of the error. Just removed the column name from feature_columns list which was full of missing value and it worked.