0
votes

I am starting to write the learning machine model. I have a Y_train dataset containing the labels where there are 5 classes. The X_train dataset contains the samples. I try to make my model with the help of a logistic regression.

X_train ((560, 20531)) and Y_train ((560, 5)) have the same dimensions.

I have seen a few publications associated with the same problem but I have not been able to solve the problem. I don't know how to correct this error,can you help me please ?

X = pd.read_csv('/Users/lottie/desktop/data.csv', header=None, skiprows=[0])
Y = pd.read_csv('/Users/lottie/desktop/labels.csv', header=None)

Y_encoded = list()
for i in Y.loc[0:,1] :
    if i == 'BRCA' : Y_encoded.append(0)
    if i == 'KIRC' : Y_encoded.append(1)
    if i == 'COAD' : Y_encoded.append(2)
    if i == 'LUAD' : Y_encoded.append(3)
    if i == 'PRAD' : Y_encoded.append(4)
Y_bis = to_categorical(Y_encoded)


#separation of the data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y_bis, test_size=0.30, random_state=42)

regression_log = linear_model.LogisticRegression(multi_class='multinomial', solver='newton-cg')

X_train=X_train.iloc[:,1:]

#train model
train_train = regression_log.fit(X_train, Y_train)
1
What exactly is your X_train exactly? It looks like at first glance that you're inverting your number of samples with your number of features. Try X.shape and Y.shape and tell me what the console gives. - sboomi
X_train contains for each line (=samples) contains values for each data. Y_train: contains for each sample the associated class. X.shape : (801, 20532) and Y.shape (801, 2) - lmj
In all fairness, X and Y should have the same number of lines, which is consistent. X looks weird. How come you have 20532 features? - sboomi

1 Answers

0
votes

You get that error because your label is categorical. You need to use a label encoder to encode it into 0,1,2.. , check out help page from scikit-learn. Below would be an implementation using an example dataset similar to yours:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

Y = pd.DataFrame({'label':np.random.choice(['BRCA','KIRC','COAD','LUAD','PRAD'],560)})
X = pd.DataFrame(np.random.normal(0,1,(560,5)))

Y_encoded = le.fit_transform(Y['label'])

X_train, X_test, Y_train, Y_test = train_test_split(X, Y_encoded, test_size=0.30, random_state=42)

regression_log = linear_model.LogisticRegression(multi_class='multinomial', solver='newton-cg')

X_train=X_train.iloc[:,1:]

train_train = regression_log.fit(X_train, Y_train)