I have a data.table with a column that lists the harmonized tariff codes for the goods that are being shipped. There are some input issues because sometimes a row may have repeated numbers "7601.00; 7601.00" and sometimes it might have different numbers, "7601.00; 8800.00". I have not decided what to do when I have differing entries, but the first thing I want to do is get rid of the duplicates. So I wrote a custom user defined function:
unique_hscodes <- function(hs_input){
new <- strsplit(hs_input, split = ";") # Delimiter ;
new <- lapply(new, str_replace_all, " ", "")
if (length(unique(unlist(new))) == 1) { # Unique HS code
return(unique(unlist(new)))
}
else {
new <- names(sort(table(unlist(new)),decreasing=TRUE)[1]) # Most frequent
return(new)
}
}
When I do, DT[, hs_code := unique_hscodes(hscode)]
it returns me a data table with a column hs_code with the same number. But when I do DT[, hs_code := unique_hscodes(hscode), by =1:nrow(DT)]
, it is done properly.
Can someone please explain what is going on here?
DT[, hs_code := sapply(hscode, unique_hscodes)]
works? – Jaap