0
votes

newbie to data science here.

I have a dataset of high dimensionality. There are 83 samples with 2308 dimensions, its shape is (83, 2308). In addition ,I have an array of sample types, which is 83 in length, its shape is (83,).

I'm trying to train a KNN classifier (2 neighbors) with a subset of my original dataset and use it to predict the sample type of the remaining data points (the test subset). My training data has the shape (66, 2308) and I'm training it to a sample types array of shape (63,).

My goal is to train my KNN classifier with a training set that is reduced in dimensionality, so I've run PCA on it. I've kept only the first 10 PCs. After transforming my training set, its shape is (63, 10).

Unfortunately, now I'm unable to use this reduced training set to make predictions on my unreduced testing set. Running my code gives me the error: "query data dimension must match training data dimension".

I'd like to be able to incorporate the first 10 PCs into my KNN model. Any help on making this happen?

Here's my code for reference:

import numpy as np
from sklearn.neighbors import KNeighborsClassifier

# creates my training and testing partitions
train_ind, test_ind = test_train_id(cancer_types, 0.8)

# create the train partition
genes_train = genes[train_ind, :]

# perform PCA on the train partition
gene_pca = PCA(10)
gene_pca.fit(genes_train)

# transform the gene partition with the PCA
genes_train_red = gene_pca.transform(genes_train) 

# the KNN model
model = KNeighborsClassifier(2)
model.fit(genes_train_red, cancer_types[train_ind])

predict = model.predict(genes[train_ind])

np.mean(predict == cancer_types[test_ind])


print('The unreduced train set has shape',genes[train_ind, :].shape)
print('The label set being trained to has shape', cancer_types[train_ind].shape)
print('------', '\n', 'After PCA, the reduced train set has shape', genes_train_red.shape ,'\n')

print('The unreduced test set has shape', genes[test_ind].shape)
1

1 Answers

1
votes

You fitted your model on the reduced-dimensions with that line:

model.fit(genes_train_red, cancer_types[train_ind])

Now you are asking to predict some other data like this:

predict = model.predict(genes[train_ind])

Of course, model.predict() can only predict samples with the same input-dimension (you only kept 10 PCA-components). So without transforming your new input (which is still in it's original form; not reduced by PCA), it won't work.

Correct usage would look like:

predict = model.predict(gene_pca.transform(genes[train_ind]))