1
votes

I am doing kNN classification on some data. I have data that I split randomly for training and testing sets in the 80/20 ratio. My data looks like this:

[ [1.0, 1.52101, 13.64, 4.49, 1.1, 71.78, 0.06, 8.75, 0.0, 0.0, 1.0], 
  [2.0, 1.51761, 13.89, 3.6, 1.36, 72.73, 0.48, 7.83, 0.0, 0.0, 2.0],
  [3.0, 1.51618, 13.53, 3.55, 1.54, 72.99, 0.39, 7.78, 0.0, 0.0, 3.0],
  ...
]

Items in last column of the matrix are classes: 1.0, 2.0 and 3.0

After feature normalization my data looks like this:

[[-0.5036443480260487, -0.03450760227559746, 0.06723230162846759, 0.23028986544844693, -0.025324623254270005, 0.010553065215338569, 0.0015136367098358505, -0.11291235596166802, -0.05819669234942126, -0.12069793876044387, 1.0], 
[-0.4989050339943617, -0.11566537753097901, 0.010637426608816412, 0.2175704556290625, 0.03073267976659575, 0.05764598316498372, -0.012976783512350588, -0.11815839520204152, -0.05819669234942126, -0.12069793876044387, 2.0],
...
]

Formula that I used for normalization:

(X - avg(X)) / (max(X) - min(X))

I perform kNN classification 100 times for each of the K = 1 to 25 (odd numbers only). I record average accuracies for each of K used. Here is my results:

Average accuracy for K=1 after 100 tests with different data split: 98.91313003886198 %   
Average accuracy for K=3 after 100 tests with different data split: 98.11976006170633 %    
Average accuracy for K=5 after 100 tests with different data split: 97.71226079929019 %  
Average accuracy for K=7 after 100 tests with different data split: 97.47493145754373 %    
Average accuracy for K=9 after 100 tests with different data split: 97.16596220947888 %   
Average accuracy for K=11 after 100 tests with different data split: 96.81465365733266 %   
Average accuracy for K=13 after 100 tests with different data split: 95.78772655522567 %    
Average accuracy for K=15 after 100 tests with different data split: 95.23116406332706 %    
Average accuracy for K=17 after 100 tests with different data split: 94.52371789094929 %    
Average accuracy for K=19 after 100 tests with different data split: 93.85285871435981 %   
Average accuracy for K=21 after 100 tests with different data split: 93.26620809747965 %    
Average accuracy for K=23 after 100 tests with different data split: 92.58047022661833 %
Average accuracy for K=25 after 100 tests with different data split: 90.55746523509124 %

But when I apply feature normalization accuracy rate drops significantly. My results of kNN with normalized features:

Average accuracy for K=1 after 100 tests with different data split: 88.56128075154439 % 
Average accuracy for K=3 after 100 tests with different data split: 85.01466511662318 %    
Average accuracy for K=5 after 100 tests with different data split: 83.32096281613967 %    
Average accuracy for K=7 after 100 tests with different data split: 83.09434478900455 %   
Average accuracy for K=9 after 100 tests with different data split: 82.05628926919964 %  
Average accuracy for K=11 after 100 tests with different data split: 79.89732262550343 %   
Average accuracy for K=13 after 100 tests with different data split: 79.60617886853211 %    
Average accuracy for K=15 after 100 tests with different data split: 79.26511126374507 %    
Average accuracy for K=17 after 100 tests with different data split: 77.51457877706329 %   
Average accuracy for K=19 after 100 tests with different data split: 76.97848441605367 %    
Average accuracy for K=21 after 100 tests with different data split: 75.70005919265326 %    
Average accuracy for K=23 after 100 tests with different data split: 76.45758217099551 %   
Average accuracy for K=25 after 100 tests with different data split: 76.16619492431572 %

My algorithms in code does not have logic errors and I checked it on simple data.


Why accuracy rate of kNN classification decreases so much after feature normalization? I guess normalization itself is not supposed to deteriorate accuracy rate of any classification. What is the purpose of using feature normalization then?

2

2 Answers

3
votes

It is a general misconception that normalization will never reduce classification accuracy. It very well can.

HOW ?

Relative values in a row are also very important. They do determine as a matter of fact, the placement of points in the feature space. When you carry out normalization, it can severely offset that relative placement. This is felt, especially in k-NN classification, because it directly operates with respect to distance between points. Compared to that, its effect is not felt so strongly in SVM, because in that case, the optimization process can still be able to find a reasonably accurate hyperplane.

You should also note that here, you normalize using avg(X). Thus, consider two points in adjacent columns of a particular row. If the first point falls well below average, and the second point falls well above average of their respective columns, while in an unnormalized sense, they are very close numerical values, distance calculation can differ hugely.

Never expect normalization to do wonders.

2
votes

KNN works in a way that it finds the instances similar to it. As it calculates the Euclidean Distance between two points. Now by normalization you are changing the scale of features which changed your accuracy.

Look at this research. Go to figures you will find different scaling techniques giving different accuracies.