1
votes

I am trying to merge two data frames in R according to ID-variable , since in the first data frame the ID-variable is a factor and in the second data frame the ID-variable is a numeric vector

datC <- data.frame("ID" = c("001","010","100","110"), "X" = rnorm(4))
datD <- data.frame("ID" = c(001,010,100,110,210), "Y" = rpois(5,3))
merge(datC,datD,by="ID")

ID           X  Y
100  0.03284493 2
110 -0.41690986 3

As we see the 001,010 are missing , and it is clear for me why they are missing and how I can solve this

datC$ID <- as.numeric(as.character(datC$ID))
merge(datC,datD,by="ID")

I have tried to find another solution by let R avoid TO treat a vector of character as a factor

datC <- data.frame("ID" = c("001","010","100","110"), "X" = rnorm(4),stringsAsFactors = F)
datD <- data.frame("ID" = c(001,010,100,110,210), "Y" = rpois(5,3))

When I merged these two data frames according to the ID-variable , I expected that I will get null results , since we have a character vs numeric vectors but in fact R gave me the same results as the ID-Variable in the first data frame is a factor .

merge(datC,datD,by="ID")

ID          X  Y
100 -0.2797956 4
110 -1.0397791 4

So can someone explain me why my expectation was false !

1

1 Answers

1
votes

When you merge two columns, they are coerced to a common class:

Factors, raw vectors and lists are converted to character vectors, and then x and table are coerced to a common type (the later of the two types in R's ordering, logical < integer < numeric < complex < character) before matching.

(from the ?match help page, linked from ?merge)

So your numeric column becomes character, as does your factor:

as.character(c(001,010,100,110,210))
# [1] "1"   "10"  "100" "110" "210"

This explains the result seen in the OP. This sort of coercion is common throughout R.


Comment. For ID columns, I always use character; it's a lot less hassle than the alternatives. sprintf is handy for adding leading zeros, ensuring all IDs are the same character-length:

sprintf("%03d", c(001,010,100,110,210))
# [1] "001" "010" "100" "110" "210"