I use factors somewhat infrequently and generally find them comprehensible, but I often am fuzzy about the details for specific operations. Currently, I am coding/collapsing categories with few observations into "other" and am looking for a quick way to do that--I have a perhaps 20 levels of a variable, but am interested in collapsing a bunch of them to one.
data <- data.frame(employees = sample.int(1000,500),
naics = sample(c('621111','621112','621210','621310','621320','621330','621340','621391','621399','621410','621420','621491','621492','621493','621498','621511','621512','621610','621910','621991','621999'),
100, replace=T))
Here are my levels of interest, and their labels in separate vectors.
#levels and labels
top8 <-c('621111','621210','621399','621610','621330',
'621310','621511','621420','621320')
top8_desc <- c('Offices of physicians',
'Offices of dentists',
'Offices of all other miscellaneous health practitioners',
'Home health care services',
'Offices of Mental Health Practitioners',
'Offices of chiropractors',
'Medical Laboratories',
'Outpatient Mental Health and Substance Abuse Centers',
'Offices of optometrists')
I could use the factor()
call, enumerate them all, classifying as "other" for each time a category had few observations.
Assuming that the top8
and top8_desc
above are the actual top 8, what is the best way to declare data$naics
as a factor variable so that the values in top8
are correcly coded and everything else is recoded as other
?