2
votes

Doing pre-processing in Data Mining sometimes involve re-grouping and re-coding categorical variables. It is well known that once you recode categorical variables in R (i.e. function mapvalues) you need to update your categorical variable with df$variable <- factor(df$variable) so that you can view the real number of levels in your data.frame with str(df).

I have written a piece of code to update automatically the categorical variables of a dataset:

cat <- sapply(df, is.factor) #Select categorical variables
names(df[ ,cat]) #View which are they
A <- function(x) factor(x) #Create function for "apply"
df[ ,cat] <- data.frame(apply(df[ ,cat],2, A)) #Run apply function
str(df) #Check

My question is: how could I select columns whose number of levels is equal to 1, once I have updated my dataset? I have tried these lines without luck:

cat <- sapply(df, is.factor) #Select categorical variables
categorical <- df[,cat] #Create a df named "categorical" separating them
A <- function(x) nlevels(x)==1 #Create "A" function for apply
x <- data.frame(apply(categorical,2, A)) #Run apply function
utils::View(x) #Check and see it is not working...

I appreciate your help and time

2
May be indx <- sapply(df[,cat], nlevels)==1; df[,cat][,indx]akrun
Nice! I thought about length(levels())drmariod
@akrun Your line works perfectly. Thank you very much for your response. It returns a logical with TRUE/FALSE depending it's a 1 level categorical variable or not. I will tag it as correct answer.NuValue
I posted that as a solution. Thanks for the feedback.akrun

2 Answers

2
votes

You can create a logical index with sapply and use that to filter out the columns. The reason

  indx <- sapply(df[,cat], nlevels)==1
  df[,cat][,indx, drop=FALSE]

Or another option is Filter

 Filter(function(x) nlevels(x)==1, df[,cat])

Or

 Filter(Negate(var), df[,cat])

Regarding why the apply didn't work,

 apply(df[cat], 2, nlevels)
 # V1  V2  V3  V4  V5  V6  V7  V8  V9 V10 
 # 0   0   0   0   0   0   0   0   0   0 

the output is 0 for all the columns, so something is not correct. Upon further checking

 apply(df[cat], 2, class)
 #       V1          V2          V3          V4          V5          V6 
 #"character" "character" "character" "character" "character" "character" 
 #       V7          V8          V9         V10 
 #"character" "character" "character" "character" 

And the correct class can be found from

 sapply(df[cat], class)
 #    V1       V2       V3       V4       V5       V6       V7       V8 
 #"factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" 
 #    V9      V10 
 #"factor" "factor" 

The class of the columns got changed from 'factor' to 'character' because the output of apply is a matrix and a matrix can hold only a single class. If there is any non-numeric column, it will convert the whole matrix columns to 'character' class. You can use apply for a numeric matrix as the the return class will be also 'numeric. In general, when there are mixed class columns, it is better to use lapply/vapply and to get a logical vector or so sapply is also useful.

data

set.seed(64)
df <- as.data.frame(matrix(sample(LETTERS[1:3], 3*10, replace=TRUE), ncol=10))

df <- cbind(df, V11=1:3)
cat <- sapply(df, is.factor) 
1
votes

I have a dataframe which is named as train_1. 1. I am trying to find out the categorical variables which have levels > 2 and less than 20 lets say. 2. Categorical variables which have levels > 2

Find out categorical variables

cat <- sapply(train_1, is.factor) #Select categorical variables

Levels >2

indx <- sapply(df[,cat], nlevels(df[,cat])>2)
df[,cat][,indx, drop=FALSE]

Error:

   indx <- sapply(df[,cat], nlevels(df[,cat])>2)
   Error in match.fun(FUN) : 
  'nlevels(df[, cat]) > 2' is not a function, character or symbol
  > df[,cat][,indx, drop=FALSE]
  Error in `[.data.frame`(df[, cat], , indx, drop = FALSE) : 
  object 'indx' not found


   >cat
    Store     DayOfWeek          Date         Sales     Customers 
    FALSE         FALSE         FALSE         FALSE         FALSE 
     Open         Promo  StateHoliday SchoolHoliday 
     TRUE          TRUE          TRUE          TRUE 

     filter1<-Filter(function(x) nlevels(x)>2, df[,cat])
     head(filter1)
   StateHoliday
1               0
1116            0
2231            0
3346            0
4461            0
5576            0

There are so many categorical variables in my cat, but this output is strange. Open, Promo columns are not there for eg