0
votes

I'm trying to use knn on my dataset that has 65499 rows and 6 columns

My dataset:

    > dput(head(sampleknn))
structure(list(RequestorSeniority = c(1L, 2L, 2L, 4L, 1L, 4L), 
    ITOwner = c(50L, 15L, 15L, 22L, 22L, 38L), Severity = c(2L, 
    1L, 2L, 2L, 2L, 2L), Priority = c(0L, 1L, 0L, 0L, 1L, 3L), 
    daysOpen = c(3L, 5L, 0L, 20L, 1L, 0L), Satisfaction = structure(c(4L, 
    4L, 3L, 3L, 4L, 3L), .Label = c("Amazing", "Satisfied", "Unknown", 
    "Unsatisfied"), class = "factor")), .Names = c("RequestorSeniority", 
"ITOwner", "Severity", "Priority", "daysOpen", "Satisfaction"
), row.names = c(NA, 6L), class = "data.frame")

>str(sampleknn)
    'data.frame':   65499 obs. of  6 variables:
     $ RequestorSeniority: int  1 2 2 4 1 4 3 4 2 3 ...
     $ ITOwner           : int  50 15 15 22 22 38 10 1 14 46 ...
     $ Severity          : int  2 1 2 2 2 2 2 2 2 2 ...
     $ Priority          : int  0 1 0 0 1 3 3 0 2 1 ...
     $ daysOpen          : int  3 5 0 20 1 0 9 15 6 1 ...
     $ Satisfaction      : Factor w/ 4 levels "Amazing","Satisfied",..: 4 4 3 3 4 3 3 3 4 4 ...

Now I'm trying to use knn on this dataset (code below) and it gives me the following error:

Error in knn(train = sampleknn_train, test = sampleknn_test, cl = sampleknn_test_target, : 'train' and 'class' have different lengths

Code:

sampleknn <- read.csv(file="HelpDesk.csv",head=TRUE,sep=",")
str(sampleknn)
#---scaling
normalize <- function(x) {
  return((x-min(x))/(max(x)-min(x)))
}

sampleknn_n <- as.data.frame(lapply(sampleknn[ ,c(1,2,3,4,5)], normalize))
str(sampleknn_n)

#train the data from sampleknn_n
sampleknn_train <- sampleknn_n[1:65000, ]
#create a test dataset
sampleknn_test <- sampleknn_n[65001:65499, ]
#isolate test and train satisfaction levels
sampleknn_train_target <- sampleknn[1:65000, 6]
sampleknn_test_target <- sampleknn[65001:65499, 6]

#-----knn model
sqrt(65499)
m1 <- knn(train=sampleknn_train, test=sampleknn_test, cl=sampleknn_test_target,k=255)

Now, when I run the last line (m1 <-...) it gives me the error 'train' and 'class' have different lengths. I tried looking for answers which talks about the same issue but nothing seems to work for me. What is the fix for this issue? Kindly let me know if you need more information.

Edit:

Before Normalization:

RequestorSeniority ITOwner Severity Priority daysOpen Satisfaction
                  1      50        2        0        3  Unsatisfied
                  2      15        1        1        5  Unsatisfied
                  2      15        2        0        0      Unknown
                  4      22        2        0       20      Unknown
                  1      22        2        1        1  Unsatisfied
                  4      38        2        3        0      Unknown

After Normalization:

RequestorSeniority      ITOwner Severity     Priority      daysOpen
       0.0000000000 1.0000000000     0.50 0.0000000000 0.05555555556
       0.3333333333 0.2857142857     0.25 0.3333333333 0.09259259259
       0.3333333333 0.2857142857     0.50 0.0000000000 0.00000000000
       1.0000000000 0.4285714286     0.50 0.0000000000 0.37037037037
       0.0000000000 0.4285714286     0.50 0.3333333333 0.01851851852
       1.0000000000 0.7551020408     0.50 1.0000000000 0.00000000000

> dput(head(sampleknn_n))
structure(list(RequestorSeniority = c(0, 0.333333333333333, 0.333333333333333, 
1, 0, 1), ITOwner = c(1, 0.285714285714286, 0.285714285714286, 
0.428571428571429, 0.428571428571429, 0.755102040816326), Severity = c(0.5, 
0.25, 0.5, 0.5, 0.5, 0.5), Priority = c(0, 0.333333333333333, 
0, 0, 0.333333333333333, 1), daysOpen = c(0.0555555555555556, 
0.0925925925925926, 0, 0.37037037037037, 0.0185185185185185, 
0)), .Names = c("RequestorSeniority", "ITOwner", "Severity", 
"Priority", "daysOpen"), row.names = c(NA, 6L), class = "data.frame")
1
Can you give us a reproduicble example? stackoverflow.com/help/mcve - Hack-R
@Hack-R here is the example i'm trying to replicate, (however in the video he uses the iris dataset) youtube.com/watch?v=DkLNb0CXw84 - user1711524
Thanks but you need to provide your reproducible example in the question so that we can copy and paste it to reproduce your result. BTW did you look at stackoverflow.com/questions/16276388/… ? - Hack-R
@Hack-R yes i did but that doesn't solves it, btw i edited the Q with the dataset I'm using - user1711524
Thanks, but head() isn't the same as providing a reproducible dataset. You should use dput() or a builtin data set. Hover your mouse over the R tag for more info on this. - Hack-R

1 Answers

0
votes

From ?knn:

cl        factor of true classifications of training set

therefore you should write your statement:

m1 <- knn(train=sampleknn_train, test=sampleknn_test, cl=sampleknn_train_target,k=255)