1
votes

First of all, I would like to point out that I am a beginner in Matlab, so I apologize if my question sounds dumb.

I have a dataset with 1460 rows, and 36 columns. Three of those columns have some missing values, which appear as NaN. I want to use the k-nearest neighbour approach to estimate those NaNs, but after over 9 hours of trying I'm still not even a step closer to getting a result.

The column with most missing values is the first column, so let's assume I want to work on that first. The professor has told me to first identify which of the other columns is correlated to the first column. Secondly, I have to split my dataset to a row vector of NANs only and a matrix of what's left , let's call it matrix A for simplicity. Thirdly, I have to use knnsearch to find the indices from the matrix A and then replace the NaNs of the row vector by those indices.

For some reason I am not able to understand the instructions, and I do not think my task is supposed to be rocket science. Is there any simpler way? I just need to fill those missing values in through KNN.

Feedback would be appreciated. Thank you.

2

2 Answers

0
votes

Matlab has a built in knn function that you can use.

Here is an example of how to use it in the Command Window.

>> nanmatrix = [NaN 1 0;1 -1 1;1 0 0]

nanmatrix =

   NaN     1     0
     1    -1     1
     1     0     0

>> cleanmatrix = knnimpute(nanmatrix,1)

cleanmatrix =

     0     1     0
     1    -1     1
     1     0     0

>> cleanmatrix = knnimpute(nanmatrix,2)

cleanmatrix =

    0.3090    1.0000         0
    1.0000   -1.0000    1.0000
    1.0000         0         0

The first "cleanmatrix" comes from an estimation where k=1. The second is from an estimation where k=2.

Hope this helps!

0
votes

Without taking into account columns that contain missing values (missing fields), use the other columns to get similarity between records (You can use Euclidian distance to do this). Then, using the KNN algorithm, find the closest records to a record that contains missing fields, and replace the average of the fields in the KNN set of that record by the missing field in each record.