0
votes

Data:

DB1 <- data.frame(orderItemID  = 1:10,     
price = c("12.90","8.90","Mrz 40","79.95","Dez 45", "7.99","Jun 90","129.90","Jul 90","49.95")  

Expected Outcome:

   DB1 <- data.frame(orderItemID  = 1:10,     
 price = c("12.90","8.90","3.40","79.95","12.45", "7.99","6.90","129.90","7.90","49.95") 

Hey guys, it´s me again ;) and unfortunately I have a pretty difficult problem in my data set I need to solve... As you can see above I have some correct an some incorrect prices. At the incorrect prices there are always letters instead of a numbers before the decimal point (and theres no decimal point at the wrong prices). The 3 letters are acronyms for months of the year. So for example Dez is 12. month of the year, so the correct number is 12. So that Dez 45 should become 12.45. 2.example: Jun is the 6.month of the year so the correct number is 6. So that Jun 90 should become 6.90. (Hope its clear what I mean)

So thats what I want to transform

Jan=1.

Feb=2.

Mrz=3.

Apr=4.

Mai=5.

Jun=6.

Jul=7.

Aug=8.

Sep=9.

Okt=10.

Nov=11.

Dez=12.

I really have no idea this time how to solve this problem...

Hope somebody got an idea

3
Dez or Dec? and Mar or Mrz? and Okt or Oct? - user227710
Dez, Mrz Okt: it´s german ;) - Jarvis
Wow- so many good answers in such a short time: love you guys! :) - Jarvis

3 Answers

4
votes

if you are using first three letters of each month

library(qdap) # mgsub
DB1$price<-mgsub(month.abb,1:12,DB1$price) 
#month.abb from baseR give abbreviated months

if you are sticking to your own month abbreviations:

month_abb <-c("Jan","Feb","Mrz","Apr","Mai","Jun","Jul","Aug","Sep","Okt","Nov","Dez")
DB1$price<-mgsub(month_abb,1:12,DB1$price)
3
votes

Here's one way using regular expressions. First, make sure the price column is a character vector and not a factor

DB1$price<-as.character(DB1$price)

Then define your desired replacements

replacewith<-c("Jan"="1.", "Feb"="2.", "Mrz"="3.", "Apr"="4.", "Mai"="5.", "Jun"="6.", "Jul"="7.", "Aug"="8.", "Sep"="9.", "Okt"="10.", "Nov"="11.", "Dez"="12.")

Now turn those into a regular erxpression and match against your price column

re <- paste0("^(",paste(names(replacewith),collapse="|"), ") ")
m <- regexpr(re,DB1$price, perl=T)
mm <- regmatches(DB1$price, m)

Now we do a look up for the replacement

regmatches(DB1$price, m) <- replacewith[substr(mm, 1, nchar(mm)-1)]
DB1$price
#  [1] "12.90"  "8.90"   "3.40"   "79.95"  "12.45"  "7.99"   "6.90"   "129.90" "7.90"   "49.95" 
1
votes

My solution works under the assumption that the "wrong prices" always contain a number with two digts.

Let me start by constructing your sample data set again:

DB1 <- data.frame(orderItemID  = 1:10,     
              price = c("12.90","8.90","Mrz 40","79.95","Dez 45", "7.99","Jun 90","129.90","Jul 90","49.95"),
              stringsAsFactors=FALSE) 

I added the argument stringsAsFactors=FALSE, which is important for the solution to work. It makes sure that prices are actually strings and not factor variables.

The solution then works in two steps. First, I simply convert the prices to numeric:

price_num <- as.numeric(DB1$price)

This converts the wrong prices to NA, which makes them easy to find by

wrong_prices <- is.na(price_num)

Next, you have to define the abbreviations for the months that are actually used:

m_abb <- c("Jan","Feb","Mrz","Apr","Mai","Jun","Jul","Aug","Sep","Okt","Nov","Dez")

There is the built-in variable month.abb. If it matches the month abbreviations in your data set, you can use that one, of course. This depends on your locale settings and in my case, I hade to redefine the variable manually.

No I define the function that converts the wrong prices to numeric:

convert_wrong_prices <- function(wp) {
   wp_split <- strsplit(wp," ")
   convert_wps <- function(wps) {
      match(wps[1],m_abb) + as.numeric(wps[2])/100
   }
   wp_converted <- sapply(wp_split,convert_wps)
   return (wp_converted)
}

And finally, it can be applied:

price_num[wrong_prices] <- convert_wrong_prices(DB1$price[wrong_prices])
DB1$price_num <- price_num