0
votes

I am trying to predict the Species (3 classes) from the iris dataset:

> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

I've created the numerical vectors tr and nw, which I use to subset iris so I can get training data and new data:

>knn5 <- knn(iris[tr, -5], iris[nw, -5], iris$Species[nw], k = 5, prob = TRUE)

>knn5

[1] versicolor virginica  virginica  versicolor virginica  setosa     setosa     setosa     setosa     setosa     setosa     setosa     setosa     versicolor virginica 
[16] setosa     setosa     setosa     virginica  setosa     setosa     virginica  versicolor virginica  virginica  versicolor setosa     versicolor versicolor setosa    
[31] versicolor setosa     virginica  setosa     versicolor versicolor versicolor setosa     versicolor versicolor virginica  virginica  virginica  setosa     versicolor
[46] setosa     versicolor versicolor setosa     versicolor
attr(,"prob")
 [1] 0.4000000 0.4000000 0.4000000 0.6000000 0.4000000 0.6000000 0.6000000 0.4000000 0.3333333 0.6000000 0.6000000 0.5000000 0.6000000 0.6000000 0.6000000 0.5000000
[17] 0.4000000 0.6000000 0.4000000 0.6000000 0.6000000 0.6000000 0.6000000 0.6000000 0.6000000 0.8000000 0.4000000 0.6000000 0.6000000 0.6000000 0.4000000 0.6000000
[33] 0.4000000 0.6000000 0.8000000 0.6000000 0.6000000 0.6000000 0.6000000 0.6000000 0.6000000 0.6000000 0.6000000 0.5000000 0.6000000 0.3333333 0.4000000 0.6000000
[49] 0.6000000 0.6000000
Levels: setosa versicolor virginica

I understand that the predictions are very bad because in the knn I put the wrong vector for the labels; my question is not about that.

My question is, why am I getting 0.3333333 as values for prob? Since we are looking at 5 neighbors, I would expect that we only get values of the form n/5.

My initial guess was that these are places where there was a tie; however, I then realized that values of 0.4000000 are places where there must be ties (since we only have 3 classes, so the others must've voted 0.4 and 0.2). So I'm not sure about my guess anymore.

1

1 Answers

1
votes

I assume that you are using knn from the class package. Notice that it has an argument use.all described in the documentation like this:

use.all

controls handling of ties. If true, all distances equal to the kth largest are included. If false, a random selection of distances equal to the kth is chosen to use exactly k neighbours.

The iris data contains a pair of exact duplicate points

 iris[c(102,143),]
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
102          5.8         2.7          5.1         1.9 virginica
143          5.8         2.7          5.1         1.9 virginica

So if one of these points is the 5th nearest neighbor, they both are and 6 points will be considered - not just 5.