5
votes

I'm trying to use knn in R (used several packages(knnflex, class)) to predict the probability of default based on 8 variables. The dataset is about 100k lines of 8 columns, but my machine seems to be having difficulty with a sample of 10k lines. Any suggestions for doing knn on a dataset > 50 lines (ie iris)?

EDIT:

To clarify there are a couple issues.

1) The examples in the class and knnflex packages are a bit unclear and I was curious if there was some implementation similar to the randomForest package where you give it the variable you want to predict and the data you want to use to train the model:

RF <- randomForest(x, y, ntree, type,...) 

then turn around and use the model to predict data using the test data set:

pred <- predict(RF, testData)

2) I'm not really understanding why knn wants training AND test data for building the model. From what I can tell, the package creates a matrix ~ to nrows(trainingData)^2 which also seems to be an upper limit on the size of the predicted data. I created a model using 5000 rows (above that # I got memory allocation errors) and was unable to predict test sets > 5000 rows. Thus I would need either:

a) find a way to use > 5000 lines in a training set

or

b) find a way to use the model on the full 100k lines.

1
Just wondering, how far did you eventually manage to push this -- in terms of training set size?ktdrv
@ktdrv: I believe I managed to do the full data set. I would recommend the knn implementation in the 'caret' package for 2 reasons. First it allows for tuning the 'k' parameter. Second, it's the fastest knn model I've used and it allows for parallelization (though I didn't see a huge pickup for knn stuff). Here's a good set of explanations and examples to get up and running: jstatsoft.org/v28/i05/paperscreechOwl

1 Answers

8
votes

The reason that knn (in class) asks for both the training and test data is that if it didn't, the "model" it would return would simply be the training data itself.

The training data is the model.

To make predictions, knn calculates the distance between a test observation and each training observation (although I suppose there are some fancy versions for insanely large data sets that don't check every distance). So until you have test observations, there isn't really a model to build.

The ipred package provides functions that appear structured as you describe, but if you look at them, you'll see that there is basically nothing happening in the "training" function. All the work is in the "predict" function. And those are really intended as wrappers to be used for error estimation using cross validation.

As far as limitations on the number of cases, that will be dependent on how much physical memory you have. If you're getting memory allocation errors, then you either need to reduce your RAM usage elsewhere (close apps, etc), buy more RAM, buy a new computer, etc.

The knn function in class runs fine for me with training and test data sets of 10k rows or more, although I have 8gb of RAM. Also, I suspect that knn in class will be faster than in knnflex, but I haven't done extensive testing.