3
votes

I am forming a data.frame from character data that is not under my control (from an API). I would like the resulting variables to get their most natural class with minimal fuss. Specifically, I want integer variables, not numeric, when appropriate.

I am digging this data out of XML and one attribute -- let's call it attA -- presents integers as integers, i.e. with no period and trailing zero. Another attribute -- let's call it attB -- is more generally useful and correct, but always presents numbers with one decimal place, even if that is uniformly zero. (The data could also be character, mind you!)

My initial approach was based on attA and processing through type.convert() but now I want to use attB. From reading the type.convert() docs, I'm surprised it does not produce integers when all the data could be represented as integer. Am I misreading that? Any suggestions on how to get what I want without doing some unholy processing of the character data?

attA <- c("1", "2")
str(type.convert(attA))
#>  int [1:2] 1 2

attB <- c("1.0", "2.0")
str(type.convert(attB))
#>  num [1:2] 1 2

unholy <- gsub("\\.0$", "", attB)
str(type.convert(unholy))
#>  int [1:2] 1 2

Relevant bit of type.convert() docs: "Given a character vector, it attempts to convert it to logical, integer, numeric or complex, and failing that converts it to factor unless as.is = TRUE. The first type that can accept all the non-missing values is chosen... Vectors containing optional whitespace followed by decimal constants representable as R integers or values from na.strings are converted to integer."

2
Any reason why you can't replace type.convert() with as.integer()? as.integer(attB) works well. Also read.table() could possibly be used, and you can specify colClasses there.Rich Scriven
In general, I do not know if the data will be integer only, numeric, or even character. I really want that hierarchy of logical, integer, numeric, character to be applied literally (I always use type.convert(..., as.is = FALSE)). That's why I can't use as.integer().jennybryan

2 Answers

2
votes

From reading the type.convert() docs, I'm surprised it does not produce integers when all the data could be represented as integer. Am I misreading that?

I think you may be.

In some contexts, converting a number written as 123.0 to 123 does change its meaning: the trailing zero in 123.0 can be intended to indicate that it represents a value measured to a higher degree of precision (e.g. to the nearest tenth) than 123 (which may only have been measured to the nearest integral value). (See Wikipedia's article on significant figures for a fuller explanation.) So type.convert() takes the appropriate/conservative approach of treating 123.0 (and indeed 123.) as representing numeric rather than integer values.

As a solution, how about something like this?

type.convert2 <- function(x) {
    x <- sub("(^\\d+)\\.0*$", "\\1", x)
    type.convert(x)
}

class(type.convert2("123.1"))
# [1] "numeric"
class(type.convert2("123.0"))
# [1] "integer"
class(type.convert2("123."))
# [1] "integer"

class(type.convert2("hello.0"))
# [1] "factor"
type.convert2("hello.0")
# [1] hello.0
# Levels: hello.0
1
votes

One way would be testing against values after they are coerced to integers,

res <- type.convert(attB)
if (isTRUE(all.equal((tmp <- as.integer(res)), res))) res <- tmp

Another possibility could be using trunc to test against truncated values.

type.convert won't convert the strings to integers because it uses strtol function in C, which stops at the ".". Then, in the R source, you see this line, where res is the converted string resulting from strtol,

if (*endp != '\0') res = NA_INTEGER;

It means, if the entire string wasn't valid, then it isn't an integer.