0
votes

I have some dummy data that consists of 99 rows of data, one column is free text data and one column is the cateogry. It has been categorised into either Customer Service or Not Customer Service related.

I passed the 99 rows of data into my R script, created a Corpus, cleaned and parsed my data and converted it to a DocumentTermMatrix. I then converted my DTM to a dataframe to make it easier to view. I bound the category to my new dataframe. I then split it 50/50 so 50 rows into my training set, 49 into my testing set. I also pulled out the category.

train <- sample(nrow(mat.df), ceiling(nrow(mat.df) * .5))
test <- (1:nrow(mat.df))[- train]
cl <- mat.df[, "category"]

I then created a model with the stripped out category column and passed this new model to my KNN

knn.pred <- knn(modeldata[train, ], modeldata[test, ], cl[train])
conf.mat <- table("Predictions" = knn.pred, Actual = cl[test])
conf.mat

I can then work out the accuracy, generate a cross table or export the predictions to test the accuracy of the model.

The bit i am struggling to get my head around at the moment, is how do i use the model going forward for new data.

So if i then have 10 new rows of free text data that havent been manually classified, How do i then run my knn model i have just created to classify this additional data?

Maybe i am just misunderstanding the next process.

Thanks,

1

1 Answers

0
votes

The same way you just found the hold-out test performance:

knn.pred.newdata <- knn(modeldata[train, ], completely_new_data, cl[train])

In a KNN model, your training data is intrinsically part of your model. Since it's just finding the nearest training points, how do you know which those are if you don't have their coordinates?

That said, why do you want to use a KNN model instead of something more modern (SVM, Random forest, Boosted trees, neural networks)? KNN models scale extremely poorly with the number of data points.