4
votes

I have the following data frame in R with 274569 rows and 15 columns:

> str(x2)
'data.frame':   274569 obs. of  15 variables:
 $ ykod : int  99 99 99 99 99 99 99 99 99 99 ...
 $ yad  : Factor w/ 43 levels "BAKUGAN","BARBIE",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ per  : Factor w/ 3 levels "2 AYLIK","3 AYLIK",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ donem: int  201106 201106 201106 201106 201106 201106 201106 201106 201106 201106 ...
 $ sayi : int  201106 201106 201106 201106 201106 201106 201106 201106 201106 201106 ...
 $ mkod : int  359 361 362 363 366 847 849 850 1505 1506 ...
 $ mad  : Factor w/ 11045 levels "    Hilal Gida           ",..: 5163 3833 10840 8284 10839 2633 10758 10293 6986 6984 ...
 $ mtip : Factor w/ 30 levels "Abone Bürosu                                      ",..: 20 20 20 20 20 2 2 2 11 11 ...
 $ kanal: Factor w/ 2 levels "OB","SS": 2 2 2 2 2 2 2 2 1 1 ...
 $ bkod : int  110006 110006 110006 110006 110006 110006 110006 110006 110006 110006 ...
 $ bad  : Factor w/ 213 levels "4. Levent","500 Evler",..: 25 25 25 25 25 25 25 25 25 25 ...
 $ bolge: Factor w/ 12 levels "Adana Şehiriçi",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ sevk : int  5 2 2 2 10 0 4 3 13 32 ...
 $ iade : int  0 2 1 2 4 0 3 2 0 8 ...
 $ satis: int  5 0 1 0 6 0 1 1 13 24 ...

I create a sub-matrix and display its structure:

> msub <- x2[x2$ykod == 99,]
> str(msub)
'data.frame':   14367 obs. of  15 variables:
 $ ykod : int  99 99 99 99 99 99 99 99 99 99 ...
 $ yad  : Factor w/ 43 levels "BAKUGAN","BARBIE",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ per  : Factor w/ 3 levels "2 AYLIK","3 AYLIK",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ donem: int  201106 201106 201106 201106 201106 201106 201106 201106 201106 201106 ...
 $ sayi : int  201106 201106 201106 201106 201106 201106 201106 201106 201106 201106 ...
 $ mkod : int  359 361 362 363 366 847 849 850 1505 1506 ...
 $ mad  : Factor w/ 11045 levels "    Hilal Gida           ",..: 5163 3833 10840 8284 10839 2633 10758 10293 6986 6984 ...
 $ mtip : Factor w/ 30 levels "Abone Bürosu                                      ",..: 20 20 20 20 20 2 2 2 11 11 ...
 $ kanal: Factor w/ 2 levels "OB","SS": 2 2 2 2 2 2 2 2 1 1 ...
 $ bkod : int  110006 110006 110006 110006 110006 110006 110006 110006 110006 110006 ...
 $ bad  : Factor w/ 213 levels "4. Levent","500 Evler",..: 25 25 25 25 25 25 25 25 25 25 ...
 $ bolge: Factor w/ 12 levels "Adana Şehiriçi",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ sevk : int  5 2 2 2 10 0 4 3 13 32 ...
 $ iade : int  0 2 1 2 4 0 3 2 0 8 ...
 $ satis: int  5 0 1 0 6 0 1 1 13 24 ...

Now I have a sub-matrix with 14367 rows and 15 columns, but the levels of factors are still there. They should have been decreased. For example, for yad, there should be only one factor.

How can I easily make str() to show correct info for factor levels so that when I type str(msub) it gives me correct values?

4
As the answers demonstrate, str is in fact giving you the correct information. However, working with factors in data.frames are often confusing. Consider using stringsAsFactors=FALSE in your original call to read.table or read.csv to import the data. - Andrie

4 Answers

13
votes

This is expected behavior. Factor levels that have no representation in your subset do not "disappear" until you tell them to. As of recently, you can use droplevels().

5
votes

In fact str is showing you the correct structural information: the factor has the ability to have the levels shown. Imagine concatenating two of your submatrices where one contained some of the levels and the other another set: it would be somewhat of a hassle to merge this! This is simply how factors work in R.

If you want to know which factors are 'present' in your data, one of the options is using table to count the occurrences.

If you want your factor reduced, so it only contains the levels that are actually present, you can reapply factor to it:

myfact<-factor(rep(1:2,5), levels=1:3, labels=letters[1:3])
myfact
# [1] a b a b a b a b a b
#Levels: a b c
factor(myfact)
# [1] a b a b a b a b a b
#Levels: a b

You can simply apply this to all the factor columns of your data.frame to get what you say you want.

1
votes

The levels of the factor are part of the column, and not dependant on the levels actually present:

> x <- factor(LETTERS[1:10])
> x
 [1] A B C D E F G H I J
Levels: A B C D E F G H I J
> y <- x[1]
> y
[1] A
Levels: A B C D E F G H I J
> factor(y)
[1] A
Levels: A
> 

I am sure, there is another way, but this should work.

1
votes
x <- factor(LETTERS[1:10])
y <- x[1, drop=TRUE]
y