2
votes

Brand new to R and stack. Hope I'm asking this question correctly.

I have numerous string variables that I need to recode into unique columns. The data are collected from a survey. For example, if a respondent selected "2-black" and "22-hispanic" the data are recorded in variable "string" as "2;22."

I need to recode the variables into unique binary variables with colnames as: "Black", "White", "Hispanic", etc. The columns should be populated as "TRUE" or "FALSE" by searching for number patterns in the string value.

I tried writing a function using "grepl" but it's no good. First I had to create an object "string" from the data frame (code not included). Then I ran into problems distinguishing between, say, "2" and "22".

If you run the code below you can see it's not working as intended

    strg_to_many<-function(newcol, string, number) {
    for (i in 1:length(number)){
    string<-newcol[I]
    df_temp[string]<-grepl(number[i], df_temp$string)
      }
    return(df_temp)
     }

    df_temp<-data.frame(string=c("22;2", "20", "40,20", "2"))
    newcol<-c("black" , "white", "hispanic", "other")
    number<-c("2", "20", "22", "40")
    string<-c("22;2", "20", "40;20", "2")

    df <- strg_to_many(newcol, string, number)

The output I expect is:

  • string black white hispanic other
  • 22;2 TRUE FALSE TRUE FALSE
  • 20 FALSE TRUE FALSE FALSE
  • 40;20 FALSE TRUE FALSE TRUE
  • 2 TRUE FALSE FALSE FALSE

Thank you for any help!

1
What do you expect to happen to 40,20? Will that be Other == TRUE & white == TRUE? In the case of two numbers, how are they separated? In your example you seem to have both a semicolon and a comma. It would help if you were to provide the full expected output for the sample data you give (not just one row).Maurits Evers
My mistake, sorry. They should be separated by a ";"Kate McDonald

1 Answers

1
votes

I'm not entirely clear on your expected output, but perhaps the following is what you're after.

The idea is to store the mapping between number and newcol in a data.frame and then perform a left_join after separating entries from string.

Note that this assumes that the first number in string is the number that pertains to newcol.

df_map <- data.frame(
    number = number,
    newcol = newcol)

library(tidyverse)
df_temp %>%
    separate(string, c("x1", "x2"), remove = FALSE, fill = "right") %>%
    left_join(df_map, by = c("x1" = "number")) %>%
    mutate(val = TRUE) %>%
    spread(newcol, val, fill = FALSE) %>%
    select(-x1, -x2)
#  string black hispanic other white
#1      2  TRUE    FALSE FALSE FALSE
#2     20 FALSE    FALSE FALSE  TRUE
#3   22;2 FALSE     TRUE FALSE FALSE
#4  40,20 FALSE    FALSE  TRUE FALSE

Update

In response to your clarifications, the following seems to reproduce your expected output

df_temp %>%
    rowid_to_column("row") %>%
    mutate(tmp = str_split(string, "[;,]")) %>%
    unnest() %>%
    left_join(df_map, by = c("tmp" = "number")) %>%
    mutate(val = TRUE) %>%
    select(-tmp) %>%
    spread(newcol, val, fill = FALSE) %>%
    select(-row)
#  string black hispanic other white
#1   22;2  TRUE     TRUE FALSE FALSE
#2     20 FALSE    FALSE FALSE  TRUE
#3  40,20 FALSE    FALSE  TRUE  TRUE
#4      2  TRUE    FALSE FALSE FALSE