Calculating linear discriminant classification function scores for each row in new test data

Question

I am having trouble classifying new cases in linear Discriminant Function Analysis (DFA) - specifically calculating the discriminant function values for each new test case from the raw variables, so I can then plot and overlay with the points from the training data.

For my training dataframe ref I have performed a linear DFA on a dataset with 895 rows, and 14 measurement variables using z <-lda(ref$species ~ ref$v1 + ref$v2 + ref$v3 ...etc... + ref$v14) I get seven LD functions. Then I use predict to obtain the discriminant function scores for every one of the 895 individuals on the discriminant function axes by z1 <-predict(z)$x

Now I want to classify thousands of new cases using the first two discriminant functions (lets just say it has three rows in a short example file). I call a new file test into R, which has vectors (raw measurements) with the same names as those in ref Then I call z2 <- predict(z, test) but get a warning message: "Warning message: 'newdata' had 3 rows but variables found have 895 rows" How may I simply append new vectors with these scores to my test dataframe test? Or at least produce a matrix of these scores.

I have not managed to sort myself out despite reading and trying things from these two good sites: here and here. For the second link scroll to "Loadings for the Discriminant Functions".

Surely it is something simple I am missing...perhaps the ref and test dataframes not matching somehow...

MrFlick MrFlick · Accepted Answer · 2014-05-20T04:01:11

Using the sample data from the second website you listed, I was able to run

wine <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", sep=",")
wine.lda <- lda(V1 ~ V2 + V3 + V4 + V5 + V6 + V7 +
    V8 + V9 + V10 + V11 + V12 + V13 + V14, wine)

#create "new" data                                
ss<-aggregate(.~V1, wine, mean)[-1]

#predict on new data
predict(wine.lda, ss)

So I think the problems is how you specified your model (or really, the names of the covariates of the model). I think that predict will check to make sure that

attr(wine.lda$terms,"term.labels") == names(ss)

And it is likely that all the terms in your lda model have the "ref$" part so they won't match up to your new data. I don't know why they have that awful example of formula notation on that guide. I would recommend doing as I did above. Taking off the data.frame name from each of the terms and supplying the data.frame as the second parameter. This should make it possible to match up the names with new data.

Calculating linear discriminant classification function scores for each row in new test data

1 Answers