1
votes

How can I create a subset of a dataset in SAS where I want to determine the mean, variance, and number of observations for a variable before hand?

Example:

OBS  NAME  x1 x2
1    Bill  3  2
2    James 4  5
3    Rick  5  6
4    Bob   3  7
5    Clas  5  2
6    Brye  2  9
7    Mann  8  5
8    Pids  4  8
9    Tony  0  7
10   Lou   2  6 

Say I want a subset with 3 observations, mean(x1) = 4, and std(x1) = 0.95. How would I create this subset in SAS?

I would rather not do something using proc means and guess/check repeatedly. Any help is appreciated!

Update: Created a logistic model predicting whether the observations would be in the treatment or control group. Then took the top 10% with the highest probability of being in the treatment group, but were not included in the treatment group and essentially used as the control group.

1
Are you generally picking a control group that's close to your population mean/std (or, your larger sample mean/std)? Or are you picking one that is intentionally biased by a significant amount?Joe
Also, dv: for a SAS question this is pretty well detailed and a good question. It doesn't contain actual code, but it contains a good explanation of other methods that are not working. I consider this a good question as it stands.Joe
Joe, I'd like to pick a control group that is representative of my larger sample based on mean/std of some variables. Essentially, if my larger sample has a mean of 4 and std of 0.95 I'd like to chose a sample with those given measures.GoCubs
Not sure there is much better solution than proc surveyselect and repeat if you get too far off. You can use control sorting to make it a bit more likely that it's closer, but I'm not sure that would be appropriate here. But perhaps someone else will know of one.Joe
I'm adding the sas-iml tag since I think that might be useful for solving this, particularly if Rick has a good solution. If you don't have SAS-IML we can remove the tag of course.Joe

1 Answers

0
votes

This is a variation of the knapsack problem. You are trying to find a subset of objects (here, 3 people), such that their properties come close to some specified target values (here, the sum [or mean] and corrected sum of squares [or std dev]). This is also known as a moment-matching problem.

As stated, the problem is not well defined. You need to specify an objective function to minimize. For example, you might choose the function (mean-target_mean)**2 + (stdDev - target_stdDev)**2 where (mean, stdDev) are the moments for each sampe of size 3.

For small sets (as in your example), you can do a complete enumeration of the "N choose 3" combinations to figure out which combination to choose. See the article "Generate combinations in SAS" for tips. For example, in SAS/IML, you can solve the stated problem as follows:

data A;
length NAME $5.;
input NAME $ x1 x2;
datalines;
Bill  3  2
James 4  5
Rick  5  6
Bob   3  7
Clas  5  2
Brye  2  9
Mann  8  5
Pids  4  8
Tony  0  7
Lou   2  6 
;

proc iml;
use A; read all var {Name x1}; close;

N = nrow(x1);              /* number of obs */          
k = 3;                     /* size of subset */
targetMean = 4;
targetStd = 0.95;

idx = allcomb(N, k);       /* all M='N choose 3' combinations */
X = shape( x1[idx], nrow(idx) );
mean = mean(X`);           /* 1 x M vector of sample means */
std  = std(X`);            /* 1 x M vector of sample std devs */
objective = (mean - targetMean)##2 + (std - targetStd)##2;
minVal = objective[><];    /* minimize objective */
minIdx = objective[>:<];   /* a sample that achieves minimum */

sampNames = Name[idx[minIdx,]];
sampVals = x1[idx[minIdx,]];
print sampVals[rowname=sampNames];

Of course, there might be more than one solution. This example has 8 solutions.

For problems in which there are N items and you want a subset of size k, and for which 'N choose k' is prohibitively large, you can generate random subsets by using the RANCOMB function (or PROC SURVEYSELECT, as someone mentioned). Alternately, you can formulate this problem as an optimization problem. You can use algorithms in SAS/OR or SAS/IML to solve it. For moderate-sized subsets, you could use genetic algorithms in SAS/IML, which are useful for knapsack-like problems.