0
votes

I am dealing a problem that assigns serval numbers in a column to be its corresponding characterized intervals. The intervals and its original values examples are shown below

VehicleDriverCarrierPremium_Interval<-c("(Null)",">= 0, <100",">= 100, < 200",">= 200, < 300",">= 300, < 400",">= 400, < 500",">= 500, < 600",">= 600, < 700",">= 700, < 800",">= 800, < 900")
VehicleDriverCarrierPremium<-c(423,12,NA,535,231,875)

What I want at the end would be like this:

VehicleDriverCarrierPremium [1] ">= 400, < 500" ">= 0, <100" "(Null)" ">= 500, < 600" ">= 200, < 300" ">= 800, < 900"

The problems are the original values is from 0 to 50000, and the interval levels actually do not have certain patterns, the length of the intervals will be changed as the value get larger. And there is a comma if the value is great than 1000. For example, the last two intervals are:

">= 9,000, <10,000", ">= 10,000, <50,000"

What I have done so far is very manual, I divide the different intervals into several groups and use the if and for statement to convert the original values to be its corresponding intervals. But when the levels of intervals and length of intervals changed, I have to changed manually.

So I am wondering if there is any better way can read the levels of intervals first, whose type is character. And then change the original values that falls into its corresponding intervals to be its interval.

Please let me know if you have any more information. Thank you!

2

2 Answers

0
votes

have you checked if the cut function would not do just what you want ?

cut(VehicleDriverCarrierPremium, breaks = seq(0,10000, by = 100))

I did not used the label parameter but i believe you could even get something proper with it and avoid using regular expression

0
votes

Ok here is a different approach. I am quite sure there are easier way and more efficient. I am using tidyverse to transform your character interval into 2 columns begin and end.

library(tidyverse)
tibble(int_ID = c(">= 0, <100",
              ">= 100, <200",
              ">= 200, <1,000",
              ">= 1,000, <2,000",
              ">= 2,000, <3,000",
              ">= 3,000, <5,000",
              ">= 5,000, <50,000")) %>% 
  separate(int_ID, into=c("begin","end"), ", ",remove = FALSE) %>% 
  mutate(begin = str_sub(begin,4)) %>% 
  mutate(end = str_sub(end,2)) %>% 
  mutate_at(vars(begin,end),~as.integer(str_remove(.,","))) -> intervals

VehicleDriverCarrierPremium_factor <- c()
for(i in 1:length(VehicleDriverCarrierPremium) ){ # for each element
  print(VehicleDriverCarrierPremium[i])
  if(!is.na(VehicleDriverCarrierPremium[i])){
    for (j in 1:length(intervals$int_ID)){ # we test on which interval he goes
      if(VehicleDriverCarrierPremium[i]>= intervals$begin[j] & VehicleDriverCarrierPremium[i] < intervals$end[j]){
        VehicleDriverCarrierPremium_factor <- c(VehicleDriverCarrierPremium_factor, intervals$int_ID[j])
      }
    }
    }else{
      VehicleDriverCarrierPremium_factor <- c(VehicleDriverCarrierPremium_factor, "(Null)")

  }
  print(VehicleDriverCarrierPremium_factor)
}

VehicleDriverCarrierPremium<-c(423,12,NA,535,231,875,9000)

It might take a while if you have ten of thousands of values to categorize and hundreds of interval. Even with this code we can do a lot better in term of performance if you need it.

Hopes it is what you wanted.

Tom