4
votes

Upon loading data, R converts character strings as Factors unless told so otherwise. We then have to convert Factors into character or numeric based on the underlying data. In the case of numeric values, we first convert to character string using as.character() and then convert the result to as.integer() in the case of integer values.

But upon cleaning up characters from a number using gsub, R automatically is converting the cleaned up strings into characters.

For ex:

> sal <- data.frame(name = c('abc','def','ghi','pqr'),
+                   Salary = c('$65,000','$102,000','$85,000','$72,000'))
> str(sal)
'data.frame':   4 obs. of  2 variables:
 $ name  : Factor w/ 4 levels "abc","def","ghi",..: 1 2 3 4
 $ Salary: Factor w/ 4 levels "$102,000","$65,000",..: 2 1 4 3
> sal$Salary <- gsub('\\$','',sal$Salary)
> sal$Salary <- gsub(',','',sal$Salary)
> str(sal)
'data.frame':   4 obs. of  2 variables:
 $ name  : Factor w/ 4 levels "abc","def","ghi",..: 1 2 3 4
 $ Salary: chr  "65000" "102000" "85000" "72000"
> 

We can see the 'Salary' column changes from Factor to Character after gsub. Could someone let me know if gsub also performs as.character() operation here? If so, will it not convert the column to integers as all values are integers?

3

3 Answers

2
votes

Yes, gsub performs as.character. If you type gsub in the console you can see the function

function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE, 
fixed = FALSE, useBytes = FALSE) 
{
    if (!is.character(x)) 
        x <- as.character(x)
    .Internal(gsub(as.character(pattern), as.character(replacement), 
         x, ignore.case, perl, fixed, useBytes))
}

And no, it will not convert to integer directly as it always returns a character vector. From ?gsub

sub and gsub return a character vector of the same length and with the same attributes as x (after possible coercion to character).

2
votes

You can change the levels of your factor directly which are characters:

sal <- data.frame(name = c('abc','def','ghi','pqr'),
              Salary = c('$65,000','$102,000','$85,000','$72,000'))


levels(sal$Salary) <- gsub('\\$|,', '', levels(sal$Salary))
str(sal)


> 'data.frame': 4 obs. of  2 variables:
 $ name  : Factor w/ 4 levels "abc","def","ghi",..: 1 2 3 4
 $ Salary: Factor w/ 4 levels "102000","65000",..: 2 1 4 3
0
votes

You appear to be asking a “why” question. The answer in this case undoubtably is due to the fact that the result needs to be character rather than factor, since the levels of a factor are attributes rather than the actual values. The values of a factor variable are NOT the ones you see in the str output but are inters starting at 1. The first item: “65000” would have had a value of 2, but would be displayed as 65000.

So you were correct that the value was an integer, but not the value you thought it was. The second item would have had a value of 1 because it’s attribute level would have had the lowest lexical order, despite ending up as the highest numeric value once it were converted to numeric.