Pull out only two variables from a column

Question

I have a dataframe in R for which one column has multiple variables. The variables either start with ABC, DEF, GHI. Those variables are followed by a series of 6 numbers (ie ABC052689, ABC062895, DEF045158).

For each row, i would like to pull one instance of ABC (the one with the largest number).

If the row has ABC052689, ABC062895, DEF045158, I would like it to pull out ABC062895 because it is greater than ABC052689.

I would then want to do the same for the variable that starts with DEF######.

I have managed to filter the data to have rows where ABC is there and either DEF or GHI is there:

library(tidyverse)
data_with_ABC <- test %>% 
  filter(str_detect(car,"ABC"))

data_with_ABC_and_DEF_or_GHI <- data_with_ABC %>% 
  filter(str_detect(car, "DEF") | str_detect(car, "GHI"))

I don't know how to pull out let's say ABC with the greatest number

ABC052689, ABC062895, DEF045158 -> ABC062895

To be clear: dataframe test contains one column, car, where each row of that column is the comma-separated string? — neilfws
Yes @neilfws you are correct. dataframe 'test' contains one column 'car' where each row contains a comma-separated string — rdavis

Tim Biegeleisen Tim Biegeleisen · Accepted Answer · 2019-04-09T01:37:41

For a base R solution, we can try using lapply along with strsplit to identify the greatest ABC plate in each CSV string, in each row.

df <- data.frame(car=c("ABC052689,ABC062895,DEF045158"), id=c(1),
    stringsAsFactors=FALSE)
df$largest <- lapply(df$car, function(x) {
    cars <- strsplit(x, ",", fixed=TRUE)[[1]]
    cars <- cars[substr(cars, 1, 3) == "ABC"]
    max <- cars[which.max(substr(cars, 4, 9))]
    return(max)
})
df

                            car id   largest
1 ABC052689,ABC062895,DEF045158  1 ABC062895

Note that we don't need to worry about casting the substring of the plate number, because it is fixed width text. This means that it should sort properly even as text.

Pull out only two variables from a column

2 Answers