2
votes

First off, I'm a first-time poster, so please bear with me. I've searched for answers both here and elsewhere, but have yet to find what I'm looking for. I'm quite new to SAS (and programming) and so it is highly possible I've searched for the wrong things.

Anyhow. I work in research, currently as a data manager for a big longitudinal questionnaire about work and health, a study that has been collected for the same participants for five waves of data collection. We want to facilitate the spreading of data and use of our dataset, so we'd like to create a teaching dataset from our current data. The teaching dataset currently includes 2000 randomly selected individuals and 463 variables - this is only a subset of the scales and some of the background info from the master set.

My problem is that one of the criteria that has to be met before we can start to spread the set, is that every person has to be and stay anonymous - therefore we must include random errors in the dataset. I have already grouped many of the background variables, income, age, education etc. But I want every variable to include at least some random error. I can't figure out have to do this. Most variables look like this:

Health_1 Health_n
       1        2
       4        2
       5        5
       .        1
       1        1

Most variables can have values between 1 and 5 (and missing). I've been thinking about replacing values (i.e., every 1=2, every 2=3 etc) but it will make the end result bad since many analysis will turn out weird. For every variable, I would like to randomly change, for example, 50 of the 2000 observations to any integer the variable can assume (1 to 5 or missing).

Any suggestion? I guess I could make every n'th observation of variable y to be changed to x - but that won't be random. And I would like to change all variables instead of writing code for every single variable.

2
The approach you've specified may result in some rows having no errors induced, allowing exact matching for some individuals. Even if you guarantee that every row has at least one induced error, fuzzy matching may be possible given multiple rows for the same individual. To be on the safe side, I would recommend replacing all descriptive variable names with generic ones.user667489
What is your purpose in the randomization? Do you still want to maintain variance/covariance of variables or is it meant to provide a test data set for research who can then access the full data set at some point? Would you expect the results from the TEST data to align with results from the original data set?Reeza
Thank you for your questions and sorry for my late reply, I've been on a short trip. user667489, that's a great idea and something I've also been thinking about - thanks! Reeza, no I would like to maintain the variance of variables. It is only meant to be used for students to practice on.viktorp

2 Answers

1
votes

I would use a data step and randomly pick observations to change.

data want;
set have;
/*Random uniform - change seed as you see fit*/
_rand= ranuni(1); 

/*Select approximately 50/2000 = 2.5% records*/
if _rand > 50/2000 then do;
   /*Set variable to integer 0-5*/
   var1 = floor(6*ranuni(1));
   /*if set to 0, then set missing*/
   if var1 = 0 then 
      var1=.;

   /*Do this however many times you need*/
end;
/*do not put the _rand value into the output data*/
drop _rand;
run;
0
votes

Well I don't know SAS, but I'll suggest some principle that should work genericly

  1. read the data for a field
  2. sample a random variable (usually a random number function returns a number between 0 and 1)
  3. if the sample is below a precalculated number, do the shift of the number, otherwise continue to next number.

when shifting, sample the random number again, and then multiply by 6, and round down... if 6 the field should be empty..

the precalculated number is in this case 2000/50..