0
votes

I'm getting stuck on an apparently very simple problem with factor character variables.

test = data.frame(uv=c("03834","06044","06054","03834","48557","48207","03834","06044","48557"))
test
uv=c()
for (i in 1:length(test$uv)){
  uv[i]=test[i,"uv"]
}
uv

And this is what I get :

> test = data.frame(uv=c("03834","06044","06054","03834","48557","48207","03834","06044","48557"))
> test
     uv
1 03834
2 06044
3 06054
4 03834
5 48557
6 48207
7 03834
8 06044
9 48557
> uv=c()
> for (i in 1:length(test$uv)){
+   uv[i]=test[i,"uv"]
+ }
> uv
[1] 1 2 3 1 5 4 1 2 5
> 

My question is why is it keeping the level numbers instead of the character values ?

I know that if I put :

     uv[i]=as.character(test[i,"uv"])

that works, but in "real life", my variables can be numeric, so I don't want to force it to character...

It's like something is missing in my understanding of factors !

Thanks.

1
I think the most urgent original reason was to save space, but there are other reasons to do it that way as well, namely the fact that character strings are often actually representing some kind of enumerated type. But I think this really needs to be answered by an R-oldtimer who was around when those decisions were being made. - Mike Wise

1 Answers

3
votes

If you treat your factor variable like this (for loop iterate on each element) then the info that it is stored is the position of the value and the value itself is stored in "levels". You can think of it as a look up table based on positions.

If you do this:

test = data.frame(uv=c("03834","06044","06054","03834","48557","48207","03834","06044","48557"))
test
uv= c()
for (i in 1:length(test$uv)){
  uv[i]=test[i,"uv"]
}

uv

# [1] 1 2 3 1 5 4 1 2 5

factor(uv, labels = levels(test$uv))

# [1] 03834 06044 06054 03834 48557 48207 03834 06044 48557
# Levels: 03834 06044 06054 48207 48557

You'll see that you can use the positions and the levels from your original dataset to obtain the actual values. The reason it happens is to enable you to work with integers which is faster than working with character values. And the only way to do that without losing any info is by a 1-to-1 relationship between the actual character value and an integer.

If you do:

uv2 = test[,"uv"]
uv2

# [1] 03834 06044 06054 03834 48557 48207 03834 06044 48557
# Levels: 03834 06044 06054 48207 48557

You'll see that uv2 has all the info as you didn't iterate through each element but you used the factor column as a whole.

Not sure what you mean by the "numeric variable in real-life". In that case you won't have any problems as a numeric variable is not a factor or character variable.

test = data.frame(uv=c(03834,06044,06054,03834))
test
uv= c()
for (i in 1:length(test$uv)){
  uv[i]=test[i,"uv"]
}

uv

# [1] 3834 6044 6054 3834

But you will miss any zeros in the beginning of a number.

If you prefer to work with numeric or character variables you can use the option stringsAsFactors = F, which will make sure you won't have any factor variables.

test = data.frame(uv=c("03834","06044","06054","03834","48557","48207","03834","06044","48557"),
                  stringsAsFactors = F)
test
uv= c()
for (i in 1:length(test$uv)){
  uv[i]=test[i,"uv"]
}

uv

# [1] "03834" "06044" "06054" "03834" "48557" "48207" "03834" "06044" "48557"

In that case your loop will treat numeric variables as numeric and character variables as characters without any problem.