0
votes

I have a dataframe in R for which one column has multiple variables. The variables either start with ABC, DEF, GHI. Those variables are followed by a series of 6 numbers (ie ABC052689, ABC062895, DEF045158).

For each row, i would like to pull one instance of ABC (the one with the largest number).

If the row has ABC052689, ABC062895, DEF045158, I would like it to pull out ABC062895 because it is greater than ABC052689.

I would then want to do the same for the variable that starts with DEF######.

I have managed to filter the data to have rows where ABC is there and either DEF or GHI is there:

library(tidyverse)
data_with_ABC <- test %>% 
  filter(str_detect(car,"ABC"))

data_with_ABC_and_DEF_or_GHI <- data_with_ABC %>% 
  filter(str_detect(car, "DEF") | str_detect(car, "GHI"))

I don't know how to pull out let's say ABC with the greatest number

ABC052689, ABC062895, DEF045158 -> ABC062895
2
To be clear: dataframe test contains one column, car, where each row of that column is the comma-separated string?neilfws
Yes @neilfws you are correct. dataframe 'test' contains one column 'car' where each row contains a comma-separated stringrdavis

2 Answers

0
votes

For a base R solution, we can try using lapply along with strsplit to identify the greatest ABC plate in each CSV string, in each row.

df <- data.frame(car=c("ABC052689,ABC062895,DEF045158"), id=c(1),
    stringsAsFactors=FALSE)
df$largest <- lapply(df$car, function(x) {
    cars <- strsplit(x, ",", fixed=TRUE)[[1]]
    cars <- cars[substr(cars, 1, 3) == "ABC"]
    max <- cars[which.max(substr(cars, 4, 9))]
    return(max)
})
df

                            car id   largest
1 ABC052689,ABC062895,DEF045158  1 ABC062895

Note that we don't need to worry about casting the substring of the plate number, because it is fixed width text. This means that it should sort properly even as text.

0
votes

Besides Tim's answer, if you want to do all ABC/DEF at one time, following code may help with library(tidyverse):

> df <- data.frame(car=c("ABC052689", "ABC062895", "DEF045158", "DEF192345"),                  stringsAsFactors=FALSE)
> 
> df2 = df %>% 
+     mutate(state = str_sub(car, 1, 3), plate = str_sub(car, 4, 9))
> 
> df2
        car state  plate
1 ABC052689   ABC 052689
2 ABC062895   ABC 062895
3 DEF045158   DEF 045158
4 DEF192345   DEF 192345
> 
> df2 %>% 
+   group_by(state) %>% 
+   summarise(maxplate = max(plate)) %>% 
+   mutate(full = str_c(state, maxplate))
# A tibble: 2 x 3
  state maxplate full     
  <chr> <chr>    <chr>    
1 ABC   062895   ABC062895
2 DEF   192345   DEF192345