0
votes

I have a factor containing many industry names. I need to collapse them into major categories and industries. For example, because I allowed respondents to respond with whatever they want, I have an inflated number of levels (e.g. financial services, Financial Services, Banking, Finance). Because these cases don't match, they come out as an additional level, so I'm trying to collapse them with forcats:

test <- fct_collapse(PrescreenF$Industry, Finance = c("Banking",
  "Corporate Finance", "Finance", "Financial", "financial services",
  "financial services", "Financial Services", "Financial services"),
  NULL = "H")

I get a warning that says: "Financial services" is unknown. This is extremely frustrating because when I call up the vector, I can see that it does exist. I've tried copying and pasting the exact words from the call, re-writing it and it just seems like there are hidden characters that prevent it from being changed.

How do I properly collapse these values?

-> test$industry
Banking
Corporate Finance 
Finance Financial 
financial services
financial services 
Financial Services 
Financial services

When I go to "revalue" say, the last level, "Financial services", it tells me its an unknown string.

EDIT output of dput(x$industry)

structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 
4L, 3L, 3L, 3L, 5L, 7L, 8L, 9L, 10L, 11L, 12L, 12L, 13L, 14L, 
15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 16L, 16L, 16L, 16L, 
16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 
16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 17L, 18L, 18L, 18L, 
18L, 19L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 25L, 26L, 27L, 28L
), .Label = c("", "{\"ImportId\":\"QID8_TEXT\"}", "Finance", 
"Financial ", "Financial services ", "Please indicate the industry you work in (e.g. technology, healthcare etc):", 
"Cleantech", "Delivery", "e-commerce/fashion", "Food", "Food & Bev", 
"Retail", "Service", "tech", "technology", "Technology", "IT, technology", 
"Software", "Technology ", "Tehcnology", "Consulting", "Digital advertising", 
"Education", "Higher education", "Technology, management consulting", 
"University professor; teaching, research and service", "Information Technology and Services", 
"mobile technology"), class = "factor")

EDIT Figured it out. Some of the terms had an extra space after they ended. For example, although when I called Prescreen$Industry, it would return a number of names like "Banking" and "Corporate Finance", it didn't tell me that there was a space after the level. Banking was actually.. "Banking " with an invisible space that didn't show up in R. How does one go about making sure this is visible and doesn't happen again?

Can I run a len function within a column? If so, how does that work? PrescreenF$Industry("Banking")?

1
Please share a reproducible example of your data so we can trouble shoot this. - Richard Lusch
if there are hidden characters, they are probably white space. stringr::str_trim could help, but you'd have to change the factors to character first, then back to factor. - shea
Can you post the output of dput(test$industry) or of dput(head(test, 20))? - Rui Barradas
@RuiBarradas Just added a new edit section. - D500
Did my answer work for you? - shea

1 Answers

0
votes

If "x" is your dataframe

library(stringr)

x$industry <- as.character(x$industry)
x$industry <- str_trim(x$industry)
x$industry <- as.factor(x$industry)

Then you can get back to fct_collapse() to simplify your factors.