3
votes

I used the qda{MASS} to find the classfier for my data and it always reported "some group is too small for 'qda'". Is it due to the size of test data I used for model ? I increased the test sample size from 30 to 100, it reported the same error. Helpppppppp.....

set.seed(1345)
AllMono <- AllData[AllData$type == "monocot",]
MonoSample <- sample (1:nrow(AllMono), size = 100, replace = F)
set.seed(1355)
AllEudi <- AllData[AllData$type == "eudicot",]
EudiSample <- sample (1:nrow(AllEudi), size = 100, replace = F)
testData <- rbind (AllMono[MonoSample,],AllEudi[EudiSample,])
plot (testData$mono_score, testData$eudi_score, col = as.numeric(testData$type), xlab = "mono_score", ylab = "eudi_score", pch = 19)
qda (type~mono_score+eudi_score, data = testData)

Here is my data example

>head (testData)
                              sequence mono_score eudi_score    type
PhHe_4822_404_76       DTRPTAPGHSPGAGH    51.4930   39.55000 monocot
SoBi_10_265860_58      QTESTTPGHSPSIGH    33.1408    2.23333 monocot
EuGr_5_187924_158        AFRPTSPGHSPGAGH    27.0000   54.55000 eudicot
LuAn_AOCW01152859.1_2_79 NFRPTEPGHSPGVGH    20.6901   50.21670 eudicot
PoTr_Chr07_112594_90     DFRPTAPGHSPGVGH    43.8732   56.66670 eudicot
OrSa.JA_3_261556_75    GVRPTNPGHSPGIGH    55.0986   45.08330 monocot
PaVi_contig16368_21_57 QTDSTTPGHSPSIGH    25.8169    2.50000 monocot

>testData$type <- as.factor (testData$type)

> dim (testData)
[1] 200   4

> levels (testData$type)
[1] "eudicot" "monocot" "other" 

> table (testData$type)
eudicot monocot   other 
    100     100       0

> packageDescription("MASS")
Package: MASS
Priority: recommended
Version: 7.3-29
Date: 2013-08-17
Revision: $Rev: 3344 $
Depends: R (>= 3.0.0), grDevices, graphics, stats, utils

My R version is R 3.0.2.

2
The test in MASS:::qda.default is if (any(counts < p + 1)) stop("some group is too small for 'qda'") where counts is the number of occurrences in each category and p is the number of columns in the predictor matrix ...Ben Bolker
So the number of occurrences in each category, in my situation, is the number of occurrences in each type (the response variable in model), right? The p is the number of columns in the testData? I have only two categories in "type" --- "mono" and "eudi", both number should be 100 since I sampled them with size of 100. p is 3 (there are three columns in testData--mono_score, eudi_score and type). I think my setting satisfies the "any(counts > p + 1)"user2993059
it's going to be hard to go much farther in helping you without a reproducible example (tinyurl.com/reproducible000)Ben Bolker
I have placed data example above now. Thank you:) Is that Ok?user2993059
no, but I have a guess (see below).Ben Bolker

2 Answers

3
votes

tl;dr my guess is that your predictor variables got made into factors or character vectors by accident. This can easily happen if you have some minor glitch in your data set, such as a spurious character in one row.

Here's a way to make up a data set that looks like yours:

set.seed(101)
mytest <- data.frame(type=rep(c("monocot","dicot"),each=100),
                 mono_score=runif(100,0,100),
                 dicot_score=runif(100,0,100))

Some useful diagnostics:

str(mytest)
## 'data.frame':    200 obs. of  3 variables:
## $ type       : Factor w/ 2 levels "dicot","monocot": 2 2 22 2 2 2 ...
##  $ mono_score : num  37.22 4.38 70.97 65.77 24.99 ...
##  $ dicot_score: num  12.5 2.33 39.19 85.96 71.83 ...
summary(mytest)
##       type       mono_score      dicot_score     
##  dicot  :100   Min.   : 1.019   Min.   : 0.8594  
##  monocot:100   1st Qu.:24.741   1st Qu.:26.7358  
##                Median :57.578   Median :50.6275  
##                Mean   :52.502   Mean   :52.2376  
##                3rd Qu.:77.783   3rd Qu.:78.2199  
##                Max.   :99.341   Max.   :99.9288  
## 
with(mytest,table(type))
## type
##   dicot monocot 
##    100     100 

Importantly, the first two (str() and summary()) show us what type each variable is. Update: it turns out the third test is actually the important one in this case, since the problem was a spurious extra level: the droplevel() function should take care of this problem ...

This made-up example seems to work fine, so there must be something you're not showing us about your data set ...

library(MASS)
qda(type~mono_score+dicot_score,data=mytest)

Here's a guess. If your score variables were actually factors rather than numeric, then qda would automatically attempt to create dummy variables from them which would then make the model matrix much wider (101 columns in this example) and provoke the error you're seeing ...

bad <- transform(mytest,mono_score=factor(mono_score))
qda(type~mono_score+dicot_score,data=bad)
## Error in qda.default(x, grouping, ...) : 
##    some group is too small for 'qda'
0
votes

I had this error as well, so I explained what went wrong on my side for anyone stumbling upon this in the future.

You might have factors on the variable you want to predict. All levels in this factor must have some amount of observations. If you don't have enough observations in a group, you will get this error.

For me, I removed a level completely, but there was still this level left in the factor.

To remove this you have to do this

df$var %<>% factor

NB. %<>% requires magrittr

However, even when I did this, it still failed. When I debugged this further it appears that if you subset from a dataframe that had factor applied you have to refactor again, somehow.