Select categorical variables where number of levels is equal to 1

Question

Doing pre-processing in Data Mining sometimes involve re-grouping and re-coding categorical variables. It is well known that once you recode categorical variables in R (i.e. function mapvalues) you need to update your categorical variable with df$variable <- factor(df$variable) so that you can view the real number of levels in your data.frame with str(df).

I have written a piece of code to update automatically the categorical variables of a dataset:

cat <- sapply(df, is.factor) #Select categorical variables
names(df[ ,cat]) #View which are they
A <- function(x) factor(x) #Create function for "apply"
df[ ,cat] <- data.frame(apply(df[ ,cat],2, A)) #Run apply function
str(df) #Check

My question is: how could I select columns whose number of levels is equal to 1, once I have updated my dataset? I have tried these lines without luck:

cat <- sapply(df, is.factor) #Select categorical variables
categorical <- df[,cat] #Create a df named "categorical" separating them
A <- function(x) nlevels(x)==1 #Create "A" function for apply
x <- data.frame(apply(categorical,2, A)) #Run apply function
utils::View(x) #Check and see it is not working...

I appreciate your help and time

May be indx <- sapply(df[,cat], nlevels)==1; df[,cat][,indx] — akrun
@akrun Your line works perfectly. Thank you very much for your response. It returns a logical with TRUE/FALSE depending it's a 1 level categorical variable or not. I will tag it as correct answer. — NuValue

akrun akrun · Accepted Answer · 2015-07-17T11:17:14

You can create a logical index with sapply and use that to filter out the columns. The reason

  indx <- sapply(df[,cat], nlevels)==1
  df[,cat][,indx, drop=FALSE]

Or another option is Filter

 Filter(function(x) nlevels(x)==1, df[,cat])

Or

 Filter(Negate(var), df[,cat])

Regarding why the apply didn't work,

 apply(df[cat], 2, nlevels)
 # V1  V2  V3  V4  V5  V6  V7  V8  V9 V10 
 # 0   0   0   0   0   0   0   0   0   0

the output is 0 for all the columns, so something is not correct. Upon further checking

 apply(df[cat], 2, class)
 #       V1          V2          V3          V4          V5          V6 
 #"character" "character" "character" "character" "character" "character" 
 #       V7          V8          V9         V10 
 #"character" "character" "character" "character"

And the correct class can be found from

 sapply(df[cat], class)
 #    V1       V2       V3       V4       V5       V6       V7       V8 
 #"factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" 
 #    V9      V10 
 #"factor" "factor"

The class of the columns got changed from 'factor' to 'character' because the output of apply is a matrix and a matrix can hold only a single class. If there is any non-numeric column, it will convert the whole matrix columns to 'character' class. You can use apply for a numeric matrix as the the return class will be also 'numeric. In general, when there are mixed class columns, it is better to use lapply/vapply and to get a logical vector or so sapply is also useful.

data

set.seed(64)
df <- as.data.frame(matrix(sample(LETTERS[1:3], 3*10, replace=TRUE), ncol=10))

df <- cbind(df, V11=1:3)
cat <- sapply(df, is.factor)

Select categorical variables where number of levels is equal to 1

2 Answers

data