2
votes

In previous versions of R I could combine factor levels that didn't have a "significant" threshold of volume using the following little function:

whittle = function(data, cutoff_val){
  #convert to a data frame
  tab = as.data.frame.table(table(data))
  #returns vector of indices where value is below cutoff_val
  idx = which(tab$Freq < cutoff_val)
  levels(data)[idx] = "Other"
  return(data)
}

This takes in a factor vector, looks for levels that don't appear "often enough" and combines all of those levels into one "Other" factor level. An example of this is as follows:

> sort(table(data$State))

   05    27    35    40    54    84     9    AP    AU    BE    BI    DI     G    GP    GU    GZ    HN    HR    JA    JM    KE    KU     L    LD    LI    MH    NA 
    1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1 
   OU     P    PL    RM    SR    TB    TP    TW     U    VD    VI    VS    WS     X    ZH    47    BL    BS    DL     M    MB    NB    RP    TU    11    DU    KA 
    1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     2     2     2     2     2     2     2     2     2     3     3     3 
   BW    ND    NS    WY    AK    SD    13    QC    01    BC    MT    AB    HE    ID     J    NO    LN    NM    ON    NE    VT    UT    IA    MS    AO    AR    ME 
    4     4     4     4     5     5     6     6     7     7     7     8     8     8     9    10    11    17    23    26    26    30    31    31    38    40    44 
   OR    KS    HI    NV    WI    OK    KY    IN    WV    AL    CO    WA    MN    NH    MO    SC    LA    TN    AZ    IL    NC    MI    GA    OH    **    CT    DE 
   45    47    48    57    57    64   106   108   112   113   120   125   131   131   135   138   198   200   233   492   511   579   645   646   840   873  1432 
   RI    DC    TX    MA    FL    VA    MD    CA    NJ    PA    NY 
 1782  2513  6992  7027 10527 11016 11836 12221 15485 16359 34045 

Now when I use whittle it returns me the following message:

> delete = whittle(data$State, 1000)
Warning message:
In `levels<-`(`*tmp*`, value = c("Other", "Other", "Other", "Other",  :
  duplicated levels in factors are deprecated

How can I modify my function so that it has the same effect but doesn't use these "deprecated" factor levels? Converting to a character, tabling, and then converting to the character "Other"?

4
It would be nice if you make this example reproducible by including sample input data we could use to test possible solutions.MrFlick

4 Answers

7
votes

I've always found it easiest (less typing and less headache) to convert to character and back for these sorts of operations. Keeping with your as.data.frame.table and using replace to do the replacement of the low-frequency levels:

whittle <- function(data, cutoff_val) {
  tab = as.data.frame.table(table(data))
  factor(replace(as.character(data), data %in% tab$data[tab$Freq < cutoff_val], "Other"))
}

Testing on some sample data:

state <- factor(c("MD", "MD", "MD", "VA", "TX"))
whittle(state, 2)
# [1] MD    MD    MD    Other Other
# Levels: MD Other
5
votes

I think this verison should work. The levels<- function allows you to collapse by assigning a list (see ?levels).

whittle <- function(data, cutoff_val){
  tab <- table(data)
  shouldmerge <- tab < cutoff_val
  tokeep <- names(tab)[!shouldmerge]
  tomerge <- names(tab)[shouldmerge]
  nv <- c(as.list(setNames(tokeep,tokeep)), list("Other"=tomerge))
  levels(data)<-nv
  return(data)
}

And we test it with

set.seed(15)
x<-factor(c(sample(letters[1:10], 100, replace=T), sample(letters[11:13], 10, replace=T)))
table(x)
# x
#  a  b  c  d  e  f  g  h  i  j  k  l  m 
#  5 11  8  8  7  5 13 14 14 15  2  3  5 

y <- whittle(x, 9)
table(y)
# y
#     b     g     h     i     j Other 
#    11    13    14    14    15    43 
3
votes

It's worth adding to this answer that the new forcats package contains the fct_lump() function which is dedicated to this.

Using @MrFlick's data:

x <- factor(c(sample(letters[1:10], 100, replace=T), 
              sample(letters[11:13], 10, replace=T)))

library(forcats)
library(magrittr) ## for %>% ; could also load dplyr
fct_lump(x, n=5) %>% table

# b     g     h     i     j Other 
#11    13    14    14    15    43 

The n argument specifies the number of most common values to preserve.

2
votes

Here's another way of doing it by replacing all the items below the threshold with the first and then renaming that level to Other.

whittle <- function(x, thresh) {
  belowThresh <- names(which(table(x) < thresh))
  x[x %in% belowThresh] <- belowThresh[1]
  levels(x)[levels(x) == belowThresh[1]] <- "Other"
  factor(x)
}