5
votes

How to create a new integer column recode which recodes for an existing column y in the dataframe df using dplyr approaches?

# Generates Random data 
df <- data.frame(x = sample(1:100, 50), 
                 y = sample(LETTERS, 50, replace = TRUE), 
                 stringsAsFactors = FALSE)
# Structure of the data
str(df)
# 'data.frame': 50 obs. of  2 variables:
#   $ x: int  90 4 33 85 30 19 78 77 7 10 ...
# $ y: chr  "N" "B" "P" "W" ...

# Making the character vector as factor variable
df$y <- factor(df$y)

# Structure of the data to llok at the effect of factor creation
str(df)
# 'data.frame': 50 obs. of  2 variables:
#   $ x: int  90 4 33 85 30 19 78 77 7 10 ...
# $ y: Factor w/ 23 levels "A","B","C","E",..: 12 2 14 21 12 22 7 1 6 17 ...

# collecting the levels of the factor variable
labs <- levels(df$y)

# Recode the levels to sequential integers
recode <- 1:length(labs)

# Creates the recode dataframe
dfrecode <- data.frame(labs, recode)

# Mapping the recodes to the original data
df$recode <- dfrecode[match(df$y, dfrecode$labs), 'recode']

This code works as expected. But I want to replace this approach with a dplyr or other efficient approaches. I can achieve the same using this approach if I know all the values. But I would like to do this without seeing or explicitly listing the values present in the column

1
dplyr::recode() ?RobertMyles
Should not we need to explicitly mention all the values present for using dplyr::recode() function?Prradep
The fct_anon from the forcats package can be useful?amarchin
Do you strictly need sequential integers?amarchin
what about as.numeric(df$y)?Nate

1 Answers

6
votes

The trick here is that doing as.numeric(factor) actually returns the levels as integers. So, try this

df <- data.frame(x = sample(1:100, 50), 
                 y = sample(LETTERS, 50, replace = TRUE), 
                 stringsAsFactors = FALSE)
library(dplyr)
dfrecode <- df %>%
        mutate(recode = as.numeric(factor(y)))
str(dfrecode)