Statistical mode of a categorical variable in R (using mlv)

1

votes

I want to calculate the most frequent value of a categorical variable. I tried using the mlv function in the modeest package, but getting NAs.

user <- c("A","B","A","A","B","A","B","B")
color <- c("blue","green","blue","blue","green","yellow","pink","blue")
df <- data.frame(user,color)
df$color <- as.factor(df$color)

library(plyr)
library(dplyr)
library(modeest)

summary <- ddply(df,.(user),summarise,mode=mlv(color,method="mlv")[['M']])

Warning messages:
1: In discrete(x, ...) : NAs introduced by coercion
2: In discrete(x, ...) : NAs introduced by coercion

summary
   user mode
1    A   NA
2    B   NA

Whereas, I need this:

user  mode
A     blue
B     green

What am I doing wrong? I tried using other methods, as well as just mlv(x=color). According to the help pages of modeest, it should work for factors.

I don't want to use table(), as I need a simple function that I can use to create a summary table like in this question: How to get the mode of a group in summarize in R ,but for a categorical column.

r

Maybe also relevant: "Is there a built-in function for finding the mode?" – Jaap

3

votes

You should try table. For instance, which.max(table(color)).

1

votes

The reason modeest::mlv.factor() does not work might actually be a bug in the package.

In the function mlv.factor() the function modeest:::discrete() is called. In there, this is what happens:

f <- factor(color)
[1] blue   green  blue   blue   green  yellow pink   blue  
Levels: blue green pink yellow

tf <- tabulate(f)
[1] 4 2 1 1

as.numeric(levels(f)[tf == max(tf)])
[1] NA
Warning message:
NAs introduced by coercion

This is what is returned to mlv.fator(). But levels(f)[tf == max(tf)] equals [1] "blue", hence as.numeric() cannot convert it to a number.

You can calculate the mode by finding the unique values and count how many times they appear in a vector. You can then subset the unique values for the one that appears most (i.e. the mode)

Find unique colours:

unique_colors <- unique(color)

match(color, unique_colors) returns the position of the first match of color in unique_colors. tabulate() then counts the number of times a color occurs. which.max() returns the index of the highest occuring value. This value can then be used to subset the unique colors.

unique_colors[which.max(tabulate(match(color, unique_colors)))]

Perhaps more readable using dplyr

library(dplyr)
unique(color)[color %>%
                match(unique(color)) %>% 
                tabulate() %>%
                which.max()]

Both options return:

[1] blue
Levels: blue green pink yellow

EDIT:

The best way is probably to create your own mode-function:

calculate_mode <- function(x) {
  uniqx <- unique(x)
  uniqx[which.max(tabulate(match(x, uniqx)))]
}

and then use it in dplyr::summarise():

library(dplyr)

df %>% 
  group_by(user) %>% 
  summarise(color = calculate_mode(color))

Which returns:

# A tibble: 2 x 2
    user  color
  <fctr> <fctr>
1      A   blue
2      B  green

0

votes

Solution with dplyr and purrr

you can use a more generalized version of the correct answer by @loudelouk like this:

df %>% 
  group_by(user) %>% 
  select_if(is.factor) %>% 
  summarise_all(function(x) { x %>% table %>% which.max %>% names })

or shorter:

df %>% 
  group_by(user) %>% 
  summarise_if(is.factor, .funs = function(x) { x %>% table %>% which.max %>% names})

Statistical mode of a categorical variable in R (using mlv)

3 Answers

Solution with dplyr and purrr