Create new categorical variable from existing variable

Question

I have a variable "PULocation" which is a set of integers from 1 to 265. Each number represents a unique location in new York city. Then each location is located in one of the boroughs "Bronx", "Brooklyn", "ERW", "Manhattan", "Queens", "Staten Island", or "Unknown". In my dataset, I only have the PULocation variable defined by integers and I have separate information to know what each integer represents. I want to create a separate variable that defines the borough rather than the specific location, the issue is the integers are not organized by borough, they are scattered. Ive included the mapping below to show what I'm trying to explain.

I've tried this cab_sample$PUBorough <- ifelse(cab_sample$PULocationID == c(3,18,20,31,32,46,47,51,58,59,60,78, 81,94,119,126,136,147,159,167,168,169, 174,182,183,184,185,199,200,208,212,213, 220,235,240,241,242,247,248,250,254,259), "Bronx","NOTHING") But i get this error message back

Warning message: In cab_sample$PULocationID == c(3, 18, 20, 31, 32, 46, 47, 51, 58, : longer object length is not a multiple of shorter object length

Is there a way to do this maping?

This is the mapping of each integer

I’m unsure what the question is. Does that code not do what you want? — Konrad Rudolph
No that code doesn't work because the boroughs are not broken up like that in the integers, they are scattered, for example, location 2 is in the Bronx, location 3 is in Manhatten, then location 4 is back in the Bronx. So a break at certain points doesn't work, I need to pick the specific numbers for the boroughs but im not sure the code to do something like that. I'm sorry I know I'm not being very articulate here I'm just not totally sure how to word it — JareBear
Well then you’ll need to have a mapping from location code integer to borough name in some form. Otherwise the R code has no way of knowing what you mean, short of magic. In fact, forget about R code: a human couldn’t perform the translation either without knowing the mapping. — Konrad Rudolph

FloSchmo FloSchmo · Accepted Answer · 2018-11-06T16:38:00

Each labelcorresponds to an interval between two consecutive values (for example Manhattan is the interval 102-150) in your breaksvector.
Therefore you can use the findInterval function to check in which interval (i.e. in which boroughs) each integer of your PULocation vector is. Then you can index your labels vector with the indices of the intervals returned by findInterval:

cab_data$borough <- labels[findInterval(cab_data$PULocation, breaks)]

And with dplyr:

cab_data %>% mutate(borough = labels[findInterval(PULocation, breaks)])

With the function argument left.open (logical) you can decide if you want to include the left break of each interval to the interval or not.

Create new categorical variable from existing variable

1 Answers