4
votes

I have a SAS dataset that I am converting to sql server 2005. In the sas dataset there are numeric variables that allow for multiple missing values.

For example column AGE is specified as

0-124 = number
. = missing
.A = Invalid
.B = Unavailable

Now I want to use a number for the sql server datatype and need to maintain the integrity of the data. My first thought is to create a reference table for the actual value from sas:

-1 = .A = Invalid
-2 = .B = Unavailable

This seems messy since age would require entry for each age 1-124. Curious if anyone else has run across this and has a better solution.

2

2 Answers

2
votes

I imagine that most of the time you're not going to care about why values are missing, just that they are missing - but by encoding it as negative integers, you'll have to be sure to filter those out every time you do anything with that variable. That's backwards: you should do something special to incorporate the kind-of-missingness information, not to run basic math and statistics.

Rather than trying to recreate in SQL Server a data type that it doesn't have, why not work with what it's got? I would probably just send the missing values as NULL to SQL Server and then encode the type of missingness in a second variable. When you actually need to know why data are missing, it's still there; but when you don't, SQL Server will handle the missing values in your variable using its default methods for dealing with them.

2
votes

You could use a format to recode the missings to negatives while leaving the non-missings unchanged.

proc format;
  value chgmiss
    .=.
    .a=-1
    .b=-2
    .c=-3
;
run;

data a;
input original;
new=put(original, chgmiss3.);
put original=  new=;
datalines;
.
9
102
3
2
15
90
.a
.b
.
78
;
run;