1
votes

I have a UTF-16 Unicode Text (.txt) file downloaded and defaulted as comma-separated values (.csv) when saved on a mac drive. This file contains numeric data that has 1000 separator applied for numbers that are greater than 1,000. When loading in R, this data is in character class. In order to convert to numeric class, I do the following:

tx <- read.table("/Users/username/Desktop/report.csv",sep="\t", dec = ".", fileEncoding = "UTF-16LE", fill = T, skip=1 , quote="", header=T, stringsAsFactors = FALSE)

tx$Cost <- gsub("\\,", replacement = "", x = tx$Cost)

tx$Cost <- as.numeric(tx$Cost)
Warning message:
NAs introduced by coercion 

When summarized using the head(subset()) function, the following is the result where I'm still not able to convert into numeric class:

       **Orig after_gsub as.numeric**
1      95.31      95.31      95.31
2     992.77     992.77     992.77
3 "1,719.68"  "1719.68"         NA
4 "3,135.79"  "3135.79"         NA
5     111.91     111.91     111.91
6     305.12     305.12     305.12

Can someone help me convert them into numeric class using as.numeric()?

3
I do not get any error when I take the same steps after defining: tx=data.frame(Cost=c('100','100.1','1,999.5',NA))Naisheel Verdhan
@BondedDust: I'm not exactly sure how to apply colClasses in my situation, could you give me a hint specific to this situation? I can test and provide feedback. Thanks all!ed0reddie
@beginneR: the combined as.numeric(gsub()) function doesn't seem to fix. :(ed0reddie
@eN-V: if you load a csv with 1000 separator formatting applied to a column, you might be able to duplicate the scenario; instead of assigning a df with these values on GUI. Hopefully this helps you duplicate!ed0reddie
Why not put up these data, or a sample of it and let us try it rather than guess!Brandon Bertelsen

3 Answers

2
votes

The worked example using setClass, setAs and colClasses:

 library(methods)
  setClass("chr.w.commas", contains=numeric())
  setAs("character", "chr.w.commas", function(from) 
                              as.numeric(gsub("\\,", "",from )) )
 dat <- read.table(text="Orig after_gsub num
 1      '95.31'      '95.31'      '95.31'
 2     992.77     992.77     992.77
 3 '1,719.68'  '1719.68' NA
 4 '3,135.79'  '3135.79' NA
 5     111.91 111.91 111.91
 6     305.12     305.12     305.12", header=TRUE, colClasses="chr.w.commas")
 str(dat)
'data.frame':   6 obs. of  3 variables:
 $ Orig      : num  95.3 992.8 1719.7 3135.8 111.9 ...
 $ after_gsub: num  95.3 992.8 1719.7 3135.8 111.9 ...
 $ num       : num  95.3 992.8 NA NA 111.9 ...
1
votes

Thank you everyone who helped here. I actually found that my load function was the problem, and the following code does a simple trick to read in data correctly from the start.

read.csv(filename, sep="\t", fileEncoding="UTF-16", skip=1)    
0
votes

I suspect that gsub doesn't work right on your UTF-16 strings. Perhaps you should convert the strings before doing the substitution. Try the following:

tx <- read.table("/Users/username/Desktop/report.csv",sep="\t", dec = ".", fileEncoding = "UTF-16LE", fill = T, skip=1 , quote="", header=T, stringsAsFactors = FALSE)
tx$Cost <- iconv(tx$Cost,"UTF-16","ASCII",sub='')
tx$Cost <- gsub("\\,", replacement = "", x = tx$Cost)
tx$Cost <- as.numeric(tx$Cost)