How to read data when some numbers contain commas as thousand separator?

135

votes

I have a csv file where some of the numerical values are expressed as strings with commas as thousand separator, e.g. "1,513" instead of 1513. What is the simplest way to read the data into R?

I can use read.csv(..., colClasses="character"), but then I have to strip out the commas from the relevant elements before converting those columns to numeric, and I can't find a neat way to do that.

r csvr-faq

152

votes

Not sure about how to have read.csv interpret it properly, but you can use gsub to replace "," with "", and then convert the string to numeric using as.numeric:

y <- c("1,200","20,000","100","12,111")
as.numeric(gsub(",", "", y))
# [1]  1200 20000 100 12111

This was also answered previously on R-Help (and in Q2 here).

Alternatively, you can pre-process the file, for instance with sed in unix.

67

votes

You can have read.table or read.csv do this conversion for you semi-automatically. First create a new class definition, then create a conversion function and set it as an "as" method using the setAs function like so:

setClass("num.with.commas")
setAs("character", "num.with.commas", 
        function(from) as.numeric(gsub(",", "", from) ) )

Then run read.csv like:

DF <- read.csv('your.file.here', 
   colClasses=c('num.with.commas','factor','character','numeric','num.with.commas'))

19

votes

I want to use R rather than pre-processing the data as it makes it easier when the data are revised. Following Shane's suggestion of using gsub, I think this is about as neat as I can do:

x <- read.csv("file.csv",header=TRUE,colClasses="character")
col2cvt <- 15:41
x[,col2cvt] <- lapply(x[,col2cvt],function(x){as.numeric(gsub(",", "", x))})

15

votes

This question is several years old, but I stumbled upon it, which means maybe others will.

The readr library / package has some nice features to it. One of them is a nice way to interpret "messy" columns, like these.

library(readr)
read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5",
          col_types = list(col_numeric())
        )

This yields

Source: local data frame [4 x 1]

  numbers
    (dbl)
1   800.0
2  1800.0
3  3500.0
4     6.5

An important point when reading in files: you either have to pre-process, like the comment above regarding sed, or you have to process while reading. Often, if you try to fix things after the fact, there are some dangerous assumptions made that are hard to find. (Which is why flat files are so evil in the first place.)

For instance, if I had not flagged the col_types, I would have gotten this:

> read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5")
Source: local data frame [4 x 1]

  numbers
    (chr)
1     800
2   1,800
3    3500
4     6.5

(Notice that it is now a chr (character) instead of a numeric.)

Or, more dangerously, if it were long enough and most of the early elements did not contain commas:

> set.seed(1)
> tmp <- as.character(sample(c(1:10), 100, replace=TRUE))
> tmp <- c(tmp, "1,003")
> tmp <- paste(tmp, collapse="\"\n\"")

(such that the last few elements look like:)

\"5\"\n\"9\"\n\"7\"\n\"1,003"

Then you'll find trouble reading that comma at all!

> tail(read_csv(tmp))
Source: local data frame [6 x 1]

     3"
  (dbl)
1 8.000
2 5.000
3 5.000
4 9.000
5 7.000
6 1.003
Warning message:
1 problems parsing literal data. See problems(...) for more details.

8

votes

a `dplyr` solution using `mutate_all` and pipes

say you have the following:

> dft
Source: local data frame [11 x 5]

   Bureau.Name Account.Code   X2014   X2015   X2016
1       Senate          110 158,000 211,000 186,000
2       Senate          115       0       0       0
3       Senate          123  15,000  71,000  21,000
4       Senate          126   6,000  14,000   8,000
5       Senate          127 110,000 234,000 134,000
6       Senate          128 120,000 159,000 134,000
7       Senate          129       0       0       0
8       Senate          130 368,000 465,000 441,000
9       Senate          132       0       0       0
10      Senate          140       0       0       0
11      Senate          140       0       0       0

and want to remove commas from the year variables X2014-X2016, and convert them to numeric. also, let's say X2014-X2016 are read in as factors (default)

dft %>%
    mutate_all(funs(as.character(.)), X2014:X2016) %>%
    mutate_all(funs(gsub(",", "", .)), X2014:X2016) %>%
    mutate_all(funs(as.numeric(.)), X2014:X2016)

mutate_all applies the function(s) inside funs to the specified columns

I did it sequentially, one function at a time (if you use multiple functions inside funs then you create additional, unnecessary columns)

6

votes

"Preprocess" in R:

lines <- "www, rrr, 1,234, ttt \n rrr,zzz, 1,234,567,987, rrr"

Can use readLines on a textConnection. Then remove only the commas that are between digits:

gsub("([0-9]+)\\,([0-9])", "\\1\\2", lines)

## [1] "www, rrr, 1234, ttt \n rrr,zzz, 1234567987, rrr"

It's als useful to know but not directly relevant to this question that commas as decimal separators can be handled by read.csv2 (automagically) or read.table(with setting of the 'dec'-parameter).

Edit: Later I discovered how to use colClasses by designing a new class. See:

How to load df with 1000 separator in R as numeric class?

6

votes

We can also use readr::parse_number, the columns must be characters though. If we want to apply it for multiple columns we can loop through columns using lapply

df[2:3] <- lapply(df[2:3], readr::parse_number)
df

#  a        b        c
#1 a    12234       12
#2 b      123  1234123
#3 c     1234     1234
#4 d 13456234    15342
#5 e    12312 12334512

Or use mutate_at from dplyr to apply it to specific variables.

library(dplyr)
df %>% mutate_at(2:3, readr::parse_number)
#Or
df %>% mutate_at(vars(b:c), readr::parse_number)

data

df <- data.frame(a = letters[1:5], 
                 b = c("12,234", "123", "1,234", "13,456,234", "123,12"),
                 c = c("12", "1,234,123","1234", "15,342", "123,345,12"), 
                 stringsAsFactors = FALSE)

4

votes

If number is separated by "." and decimals by "," (1.200.000,00) in calling gsub you must set fixed=TRUE as.numeric(gsub(".","",y,fixed=TRUE))

4

votes

Using read_delim function, which is part of readr library, you can specify additional parameter:

locale = locale(decimal_mark = ",")

read_delim("filetoread.csv", ';", locale = locale(decimal_mark = ","))

*Semicolon in second line means that read_delim will read csv semicolon separated values.

This will help to read all numbers with a comma as proper numbers.

Regards

Mateusz Kania

3

votes

A very convenient way is readr::read_delim-family. Taking the example from here: Importing csv with multiple separators into R you can do it as follows:

txt <- 'OBJECTID,District_N,ZONE_CODE,COUNT,AREA,SUM
1,Bagamoyo,1,"136,227","8,514,187,500.000000000000000","352,678.813105723350000"
2,Bariadi,2,"88,350","5,521,875,000.000000000000000","526,307.288878142830000"
3,Chunya,3,"483,059","30,191,187,500.000000000000000","352,444.699742995200000"'

require(readr)
read_csv(txt) # = read_delim(txt, delim = ",")

Which results in the expected result:

# A tibble: 3 × 6
  OBJECTID District_N ZONE_CODE  COUNT        AREA      SUM
     <int>      <chr>     <int>  <dbl>       <dbl>    <dbl>
1        1   Bagamoyo         1 136227  8514187500 352678.8
2        2    Bariadi         2  88350  5521875000 526307.3
3        3     Chunya         3 483059 30191187500 352444.7

0

votes

I think preprocessing is the way to go. You could use Notepad++ which has a regular expression replace option.

For example, if your file were like this:

"1,234","123","1,234"
"234","123","1,234"
123,456,789

Then, you could use the regular expression "([0-9]+),([0-9]+)" and replace it with \1\2

1234,"123",1234
"234","123",1234
123,456,789

Then you could use x <- read.csv(file="x.csv",header=FALSE) to read the file.

How to read data when some numbers contain commas as thousand separator?

11 Answers

a dplyr solution using mutate_all and pipes

a `dplyr` solution using `mutate_all` and pipes