This question is several years old, but I stumbled upon it, which means maybe others will.
The readr
library / package has some nice features to it. One of them is a nice way to interpret "messy" columns, like these.
library(readr)
read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5",
col_types = list(col_numeric())
)
This yields
Source: local data frame [4 x 1]
numbers
(dbl)
1 800.0
2 1800.0
3 3500.0
4 6.5
An important point when reading in files: you either have to pre-process, like the comment above regarding sed
, or you have to process while reading. Often, if you try to fix things after the fact, there are some dangerous assumptions made that are hard to find. (Which is why flat files are so evil in the first place.)
For instance, if I had not flagged the col_types
, I would have gotten this:
> read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5")
Source: local data frame [4 x 1]
numbers
(chr)
1 800
2 1,800
3 3500
4 6.5
(Notice that it is now a chr
(character
) instead of a numeric
.)
Or, more dangerously, if it were long enough and most of the early elements did not contain commas:
> set.seed(1)
> tmp <- as.character(sample(c(1:10), 100, replace=TRUE))
> tmp <- c(tmp, "1,003")
> tmp <- paste(tmp, collapse="\"\n\"")
(such that the last few elements look like:)
\"5\"\n\"9\"\n\"7\"\n\"1,003"
Then you'll find trouble reading that comma at all!
> tail(read_csv(tmp))
Source: local data frame [6 x 1]
3"
(dbl)
1 8.000
2 5.000
3 5.000
4 9.000
5 7.000
6 1.003
Warning message:
1 problems parsing literal data. See problems(...) for more details.