13
votes

I'm interested to specify types of missing values. I have data that have different types of missing and I am trying to code these values as missing in R, but I am looking for a solution were I can still distinguish between them.

Say I have some data that looks like this,

set.seed(667) 
df <- data.frame(a = sample(c("Don't know/Not sure","Unknown","Refused","Blue", "Red", "Green"),  20, rep=TRUE), b = sample(c(1, 2, 3, 77, 88, 99),  10, rep=TRUE), f = round(rnorm(n=10, mean=.90, sd=.08), digits = 2), g = sample(c("C","M","Y","K"),  10, rep=TRUE) ); df
#                      a  b    f g
# 1              Unknown  2 0.78 M
# 2              Refused  2 0.87 M
# 3                  Red 77 0.82 Y
# 4                  Red 99 0.78 Y
# 5                Green 77 0.97 M
# 6                Green  3 0.99 K
# 7                  Red  3 0.99 Y
# 8                Green 88 0.84 C
# 9              Unknown 99 1.08 M
# 10             Refused 99 0.81 C
# 11                Blue  2 0.78 M
# 12               Green  2 0.87 M
# 13                Blue 77 0.82 Y
# 14 Don't know/Not sure 99 0.78 Y
# 15             Unknown 77 0.97 M
# 16             Refused  3 0.99 K
# 17                Blue  3 0.99 Y
# 18               Green 88 0.84 C
# 19             Refused 99 1.08 M
# 20                 Red 99 0.81 C

If I now make two tables my missing values ("Don't know/Not sure","Unknown","Refused" and 77, 88, 99) are included as regular data,

table(df$a,df$g)
#                     C K M Y
# Blue                0 0 1 2
# Don't know/Not sure 0 0 0 1
# Green               2 1 2 0
# Red                 1 0 0 3
# Refused             1 1 2 0
# Unknown             0 0 3 0

and

table(df$b,df$g)
#    C K M Y
# 2  0 0 4 0
# 3  0 2 0 2
# 77 0 0 2 2
# 88 2 0 0 0
# 99 2 0 2 2

I now recode the three factor levels "Don't know/Not sure","Unknown","Refused" into <NA>

is.na(df[,c("a")]) <- df[,c("a")]=="Don't know/Not sure"|df[,c("a")]=="Unknown"|df[,c("a")]=="Refused"

and remove the empty levels

df$a <- factor(df$a) 

and the same is done with the numeric values 77, 88, and 99

is.na(df) <- df=="77"|df=="88"|df=="99"

table(df$a, df$g, useNA = "always")       
#       C K M Y <NA>
# Blue  0 0 1 2    0
# Green 2 1 2 0    0
# Red   1 0 0 3    0
# <NA>  1 1 5 1    0

table(df$b,df$g, useNA = "always")
#      C K M Y <NA>
# 2    0 0 4 0    0
# 3    0 2 0 2    0
# <NA> 4 0 4 4    0

Now the missing categories are recode into NA but they are all lumped together. Is there a way in a to recode something as missing, but retain the original values? I want R to thread "Don't know/Not sure","Unknown","Refused" and 77, 88, 99 as missing, but I want to be able to still have the information in the variable.

3
How about adding another column to the df called isNA which will hold true if the value is missing? or isNA column can directly hold NA and 0. It depends on rest of your code.Nishanth
That would properly work, but it's more of workaround then a solution that would work seamlessly with the rest of my code–as you also point out. Would you care to demonstrate it in an example?Eric Fail
It is difficult to predict the effect on rest of the code. may be you can write your own my.table that uses my.is.na which returns TRUE for "Don't know/Not sure","Unknown","Refused"Nishanth
It looks like you've provided us with summarized data. Do you have the data in a format that is a step before this one? If so it would just be a matter of factoring.Brandon Bertelsen
@BrandonBertelsen, thank you for your question (and your answer). The dummy data I've provided is quite close to how my real data looks. As I mentioned in my comment to @Maxim.K I could have been a bit more precise about the variable a, but aside from that the data I provided in the question is quite close to how my real data looks.Eric Fail

3 Answers

19
votes

To my knowledge, base R doesn't have an in-built way to handle different NA types. (editor: It does: NA_integer_, NA_real_, NA_complex_, and NA_character. See ?base::NA.)

One option is to use a package which does so, for instance "memisc". It's a little bit of extra work, but it seems to do what you're looking for.

Here's an example:

First, your data. I've made a copy since we will be making some pretty significant changes to the dataset, and it's always nice to have a backup.

set.seed(667) 
df <- data.frame(a = sample(c("Don't know/Not sure", "Unknown", 
                              "Refused", "Blue", "Red", "Green"),
                            20, replace = TRUE), 
                 b = sample(c(1, 2, 3, 77, 88, 99), 10, 
                            replace = TRUE), 
                 f = round(rnorm(n = 10, mean = .90, sd = .08), 
                           digits = 2), 
                 g = sample(c("C", "M", "Y", "K"), 10, 
                            replace = TRUE))
df2 <- df

Let's factor variable "a":

df2$a <- factor(df2$a, 
                levels = c("Blue", "Red", "Green", 
                           "Don't know/Not sure",
                           "Refused", "Unknown"),
                labels = c(1, 2, 3, 77, 88, 99))

Load the "memisc" library:

library(memisc)

Now, convert variables "a" and "b" to items in "memisc":

df2$a <- as.item(as.character(df2$a), 
                  labels = structure(c(1, 2, 3, 77, 88, 99),
                                     names = c("Blue", "Red", "Green", 
                                               "Don't know/Not sure",
                                               "Refused", "Unknown")),
                  missing.values = c(77, 88, 99))
df2$b <- as.item(df2$b, 
                 labels = c(1, 2, 3, 77, 88, 99), 
                 missing.values = c(77, 88, 99))

By doing this, we have a new data type. Compare the following:

as.factor(df2$a)
#  [1] <NA>  <NA>  Red   Red   Green Green Red   Green <NA>  <NA>  Blue 
# [12] Green Blue  <NA>  <NA>  <NA>  Blue  Green <NA>  Red  
# Levels: Blue Red Green
as.factor(include.missings(df2$a))
#  [1] *Unknown             *Refused             Red                 
#  [4] Red                  Green                Green               
#  [7] Red                  Green                *Unknown            
# [10] *Refused             Blue                 Green               
# [13] Blue                 *Don't know/Not sure *Unknown            
# [16] *Refused             Blue                 Green               
# [19] *Refused             Red                 
# Levels: Blue Red Green *Don't know/Not sure *Refused *Unknown

We can use this information to create tables behaving the way you describe, while retaining all the original information.

table(as.factor(include.missings(df2$a)), df2$g)
#                       
#                        C K M Y
#   Blue                 0 0 1 2
#   Red                  1 0 0 3
#   Green                2 1 2 0
#   *Don't know/Not sure 0 0 0 1
#   *Refused             1 1 2 0
#   *Unknown             0 0 3 0
table(as.factor(df2$a), df2$g)
#        
#         C K M Y
#   Blue  0 0 1 2
#   Red   1 0 0 3
#   Green 2 1 2 0
table(as.factor(df2$a), df2$g, useNA="always")
#        
#         C K M Y <NA>
#   Blue  0 0 1 2    0
#   Red   1 0 0 3    0
#   Green 2 1 2 0    0
#   <NA>  1 1 5 1    0

The tables for the numeric column with missing data behaves the same way.

table(as.factor(include.missings(df2$b)), df2$g)
#      
#       C K M Y
#   1   0 0 0 0
#   2   0 0 4 0
#   3   0 2 0 2
#   *77 0 0 2 2
#   *88 2 0 0 0
#   *99 2 0 2 2
table(as.factor(df2$b), df2$g, useNA="always")
#       
#        C K M Y <NA>
#   1    0 0 0 0    0
#   2    0 0 4 0    0
#   3    0 2 0 2    0
#   <NA> 4 0 4 4    0

As a bonus, you get the facility to generate nice codebooks:

> codebook(df2$a)
========================================================================

   df2$a

------------------------------------------------------------------------

   Storage mode: character
   Measurement: nominal
   Missing values: 77, 88, 99

            Values and labels    N    Percent 

    1   'Blue'                   3   25.0 15.0
    2   'Red'                    4   33.3 20.0
    3   'Green'                  5   41.7 25.0
   77 M 'Don't know/Not sure'    1         5.0
   88 M 'Refused'                4        20.0
   99 M 'Unknown'                3        15.0

However, I do also suggest you read the comment from @Maxim.K about what really constitutes missing values.

5
votes

To retain the original values, you can create new columns where you code the NA information , for example :

df <- transform(df,b.na = ifelse(b %in% c('77','88','99'),NA,b))
df <- transform(df,a.na = ifelse(a %in% 
                        c("Don't know/Not sure","Unknown","Refused"),NA,a))

Then you can do something like this :

   table(df$b.na , df$g)
    C K M Y
  2 0 0 4 0
  3 0 2 0 2

Another option without creating new columns is to use ,exclude option like this , to set the non desired values to NULL,( different of missing values)

table(df$a,df$g,
      exclude=c('77','88','99',"Don't know/Not sure","Unknown","Refused")) 
       C K M Y
  Blue  0 0 1 2
  Green 2 1 2 0
  Red   1 0 0 3

You can define some global constants( even it is not recommnded ) to group your "missing values", and use them in the rest of your program. Something like this :

B_MISSING <- c('77','88','99')
A_MISSING <- c("Don't know/Not sure","Unknown","Refused")
5
votes

If you are willing to stick to numeric values then NA, Inf, -Inf, and NaN could be used for different missing values. You can then use is.finite to distinguish between them and normal values:

x <- c(NA, Inf, -Inf, NaN, 1)
is.finite(x)
## [1] FALSE FALSE FALSE FALSE  TRUE

is.infinite, is.nan and is.na are also useful here.

We could have a special print function that displays them in a more meaningful way or even create a special class but even without that the above would divide the data into finite and multiple non-finite values.