0
votes

I have trouble with using proc surveyselect to randomly select sample from a population. Here is the scenario: I have a sample pool, say, 1000 observations, with variable ID, gender, income. My goal is to randomly select 400 samples to make a group 1, and the rest goes to group 2. However, the mean of income in group 1 and group 2 should be the same as the mean in sample pool. I also need the proportion of male and female in groups 1 and 2 are the same as in the pool. Is there any way to do this in proc surveyselect (SAS)? Can anyone share example syntax?

2

2 Answers

1
votes

You can control for gender by using a strata statement to tell proc surveyselect to sample each gender separately, then combine the separate samples for each gender. I think it should then be possible to use proc stdize to rescale the sample mean incomes based on the output from proc surveyselect and your original dataset. I don't have time to provide full details just now as this is quite a complex proc, but I think that's your best line of inquiry at this point.

0
votes

Really you are just talking about using strata here, if your income is (or can be treated as) a discrete variable. An example:

data population;
  call streaminit(7);
  do _n_ = 1 to 1000;
    if rand('Uniform') > 0.6 then sex='M';
    else sex='F';
    income = ceil(6*rand('Uniform'));
    output;
  end;
run;

proc freq data=population;
  tables sex income;
run;

proc sort data=population;
  by sex income;
run;

proc surveyselect data=population out=sample samprate=0.4 outall;
  strata sex income;
run;

proc sort data=sample;
  by selected;
run;


proc freq data=sample;
  by selected;
  tables sex income;
run;

That gives you a sample of 40% from each sex and income strata separately (so 40% of 'Males income=1' 40% of 'Females income=3' etc.) which will end up in your overall desired even distribution.

This doesn't work for income as a continuous variable; you can try using control there, in which case you won't have as specific of a distribution but it should still be in the ballpark.

This does have some differences in terms of probability of sample versus taking a sample of the entire population and controlling independently for the two variables - you will have 40% of each bucket of the two combined, while a sample of the whole population that has equal income and sex groupings might have a lot more of 'Female 3' and less of 'Male 3' but then more of 'Male 2' and less of 'Female 2'.