0
votes

I would like to split my observations in a "parent" dataset into their own unique "child" datasets. I need to do this for several parent datasets, so I am trying to create a macro with a do loop inside to generate these datasets. But my code is not working (perhaps for multiple reasons).

Here is manual code as an example of what I want to automate (this code works fine, the "parent" dataset ta220092 has four observations in this case, but in other "parent" datasets it may be larger or smaller):

data ta2200921 ta2200922 ta2200923 ta2200924;
set ta220092;
if _n_ = 1 then output ta2200921;
if _n_ = 2 then output ta2200922;
if _n_ = 3 then output ta2200923;
if _n_ = 4 then output ta2200924;
run;

In trying to automate this. I thought I should use the automatic "n" variable to add to the dataset name and for the %to statement since the number of observations in each "parent" dataset varies, but I am not sure how to do it. I have created the following code, which has an issue which I am hoping someone can help with:

%macro treatmentsplit(j);
%do i = 1 %to &j.;
&j. = _n_;
data tatest220092&i.;
set ta220092 (where = (_n_ = &i.));
run;
%end;
%mend treatmentsplit;
%treatmentsplit;

Thank you.

Besides editing the above for some clarity, I need to edit my question to address why I don't believe this is a duplicate question as Joe tagged. His proposed duplicate question is What's the fastest way to partition a sas dataset for batch processing?

There are two reasons why I don't think this question is a duplicate. First, the underlying reason for wanting to split is different. For my problem, this is not an issue of trying to break up a large dataset for reasonable batch processing. I will address my underlying reason for wanting to split in the next paragraph. The second reason I don't consider this a duplicate is the code to resolve the "What's the fastest way to partition a SAS dataset for batch processing" does not work for my situation. The two code answers provided specify a number of datasets the parent dataset is to be split into. I do not know in advance the number of splits for each dataset I want to split, since the number of observations vary in each dataset. I tried to modify the second answer (by RWill) for my situation, and have been unsuccessful with that so far. Here is my best attempt to modify the second answer to my situation so far (have tried variants):

%macro nobs(dsn);
%local nobs dsid rc;
%let nobs=0;
%let dsid = %sysfunc(open(&dsn));
%if &dsid %then %do;
  %let nobs = %sysfunc(attrn(&dsid,NOBS));
%end;
%else %put Open for dataset &dsn failed - %sysfunc(sysmsg());
%let rc   = %sysfunc(close(&dsid));
%mend nobs;

%macro batch_process(dsn_in,dsn_out_prefix);
%let dsn_obs = %nobs(&dsn_in);
%let obs_per_dsn = 1;
data
 %do i = 1 %to &dsn_obs;
    &dsn_out_prefix.&i
 %end; ;
 set &dsn_in;
 drop _count;
 retain _count 0;
 _count = _count + 1;
 %do i = 1 %to &dsn_obs;
    if (1 + ((&i - 1) * 1) <= _count <= (&i * 1) then do;
       output &dsn_out_prefix.&i;
    end; 
 %end;
  run;
%mend batch_process;
%batch_process( dsn_in=tmp1.ta220092 , dsn_out_prefix = ta220092);

The error from the log seems to indicate that there is an issue with the DSN_OBS variable in the do loop (5th line down in the second macro):

SYMBOLGEN: Macro variable DSN_OBS resolves to ERROR: %EVAL function has no expression to evaluate, or %IF statement has no condition. ERROR: The %TO value of the %DO I loop is invalid.

To address my underlying reason for wanting to split my dataset to be one observation per dataset, I have modified a macro which almost works the way I need to, with one issue. The original macro I modified is for propensity score matching http://www.biostat.umn.edu/~will/6470stuff/Class25-12/PSmatching.sas. I modify it to address my dataset better (changing variable names), and I also added a method I call "CC" for calculated caliper, because I want to capture all controls which are within 10 or 20% of the matching variable of my treatment group (there will be a second matching variable which is selected by nearest neighbor, but I don't have an issue with the code for that step down the line). The issue is that in a treatment dataset (such as ta220092 above), there are two observations who have matching variables that have overlapping calculated caliper "zones"--one has assets that are 62, and one has 64. The macro has a replacement option; if I select "yes", then I get the treatment matched to the same control 100 times (not what I want, I want all controls within the calculated caliper). If I select "no" for the replacement option, then the macro almost works how I want, but the control observations that are a potential match two treatment observations that have overlapping calculated calipers are split between the two treatment observations, instead of being allowed to be within the caliper of each treatment. So the macro is not allowing replacement at the dataset level, when what I want is for it to not allow replacement at the observation level. Stated another way, I do want there to be replacement between observations, but am not sure how to modify the macro. I thought it would be easier (but granted much less elegant solution) to split each treatment observation into its own data set (I have less than 600 treatments). Here is the macro I have that is functioning, but not doing quite what I want it to. (Since I am new to Stack Overflow, you can kindly point out to me if this edit is TMI, if I should have opened another question, or just given all this information in the original question--I so much appreciate your help and would like to be as little of a burden as possible).

    %macro Matching(datatreatment=, datacontrol=, method=, numberofcontrols=, caliper=, ccpercent=,
     replacement=, out=);

    /* Create copies of the treated units if N > 1 */;
     data _Treatment0(drop= i);
      set &datatreatment;
      do i= 1 to &numberofcontrols;
      RandomNumber= ranuni(12345);
    output;
    end;
    run;
    /* Randomly sort both datasets */
    proc sort data= _Treatment0 out= _Treatment(drop= RandomNumber);
    by RandomNumber;
    run;
    data _Control0;
    set &datacontrol;
    RandomNumber= ranuni(45678);
    run;
    proc sort data= _Control0 out= _Control(drop= RandomNumber);
    by RandomNumber;
    run;

     data Matched (keep = cikSelectedControl atControl roacontrol roatreat fyear  industry MatchedToTreatcik atTreat);
      length atC 8;
      length cikC 8;
      /* Load Control dataset into the hash object */
      if _N_= 1 then do;
    declare hash h(dataset: "_Control", ordered: 'no');
    declare hiter iter('h');
    h.defineKey('cikC');
    h.defineData('roac','atC','cikC');
    h.defineDone();
    call missing(cikC, atC, roac);
    end;
    /* Open the treatment */
    set _Treatment;
    %if %upcase(&method) ~= RADIUS %then %do;
    retain BestDistance 99;
    %end;
    /* Iterate over the hash */
    rc= iter.first();
    if (rc=0) then BestDistance= 99;
    do while (rc = 0);
    /* Caliper */
    %if %upcase(&method) = CALIPER %then %do;
    if (atT - &caliper) <= atC <= (atT + &caliper) then do;
    ScoreDistance = abs(atT - atC);
    if ScoreDistance < BestDistance then do;
    BestDistance = ScoreDistance;
    cikSelectedControl = cikC;
    atControl =  atC;
    MatchedToTreatcik = cikT;
    atTreat = atT;
    end;
    end;
    %end;
    /* Calculated caliper */
    %if %upcase(&method) = CC %then %do;
    ccdist = &ccpercent*atT;
    if (atT - ccdist) <= atC <= (atT + ccdist) then do;
    ScoreDistance = abs(atT - atC);
    if ScoreDistance < BestDistance then do;
    BestDistance = ScoreDistance;
    cikSelectedControl = cikC;
    atControl =  atC;

    MatchedToTreatcik = cikT;
    atTreat = atT;
    ROAControl = roaC;
    ROATreat=roat;
    end;
    end;
    %end;
    /* NN */
    %if %upcase(&method) = NN %then %do;
    ScoreDistance = abs(atT - atC);
    if ScoreDistance < BestDistance then do;
    BestDistance = ScoreDistance;
    cikSelectedControl = cikC;
    atControl =  atC;
    MatchedToTreatcik = cikT;
    atTreat = atT;
    end;
    %end;

    %if %upcase(&method) = NN or %upcase(&method) = CALIPER or %upcase(&method) = CC      %then %do;
    rc = iter.next();
    /* Output the best control and remove it */
    if (rc ~= 0) and BestDistance ~=99 then do;
    output;
    %if %upcase(&replacement) = NO %then %do;
    rc1 = h.remove(key: cikSelectedControl);
    %end;
    end;
    %end;
    /* Radius */
    %if %upcase(&method) = RADIUS %then %do;
    if (atT - &caliper) <= atC <= (atT + &caliper) then do;
    cikSelectedControl = cikC;
    atControl =  atC;
    MatchedToTreatcik = cikT;
    atTreat = atT;
    output;
    end;
    rc = iter.next();
    %end;
    end;
    run;
    /*to download datasets from wrds to investigate*/
    proc download data=matched; run;
    proc download data=_Control; run;
    /* Delete temporary tables. Quote for debugging */
    proc datasets NOLIST; /*Nolist option should prevent printing of dataset list*/
    delete _:(gennum=all);

    run;
     data &out;
       set Matched;
     run;
    proc datasets NOLIST; /*Nolist option should prevent printing of dataset list*/
    delete Matched;
    %mend Matching;


    %Matching(datatreatment= Ta220092, datacontrol= ca220092, method= cc,
     numberofcontrols= 100, caliper=1, ccpercent=.2, replacement= no, out= matchtest4);

One other note is I will be running this match via PC SAS on the WRDS system, which is faster and won't freeze up my computer during processing.

1
"I would like to split my observations in a "parent" dataset into their own unique "child" datasets." Why? What are you going to do with them next? Can't you just read the parent data set and use filters to get the rows you need?sasfrog
Don't do this. As @sasfrog says, this isn't typically a good idea. If you're trying to get some random samples of your data, use proc surveyselect, it can make plenty of samples (google "Don't be loopy SAS"). Splitting your datasets makes maintenance harder and makes your libraries and code a mess.Joe
Thank you Joe and @Sasfrog for your comments. I realize what I am trying to do is not elegant. I have edited my post to attempt to give better underlying rationale, to explain why the batch processing question doesn't address my issue, and to provide the underlying issue I am trying to solve. Perhaps you can weigh in with the additional information?mrlcpa
Wow, tl;dr. What are you trying to do in a sentence or two?sasfrog
I am trying to match observations by variable1 within a certain % then by variable2 nearest neighbor, and I have a macro that almost works at giving me the pool of controls for the first step--within a % of variable1, but it doesn't allow control observations to be replaced in the control pool for different observations. Two ways to fix--1) put each treatment observation in its own dataset (a quick question, an inelegant solution, this is my original question), or 2) fix the macro (the elegant way to solve, but more difficult).mrlcpa

1 Answers

0
votes

I have improved my understanding of the macro and modified the macro to make it work. It turns out the calculated caliper was basically a nearest neighbor match with a radius restriction. So when I modified the macro to include a calculated radius, then the macro was able to match how I need it to (see above question). Below is the modified macro:

/************************************************ 
matching.sas   adapted from

Paper 185-2007  SAS Global Forum 2007
Local and Global Optimal Propensity Score Matching
Marcelo Coca-Perraillon
Health Care Policy Department, Harvard Medical School, Boston, MA

-------------------------------
Treatment and Control observations must be in separate datasets such that 
Control data includes: cikC =  subject_cik, atC = total assets
Treatment data includes: cikT, atT = total assets
cik must be numeric 

method = NN (nearest neighbor), caliper, or radius, or CC or RC --  CC/RC added by 
MRL calcpercent= percentage to be applied to ccvariable or rcvariable to create 
calculated caliper or calculated radius

caliper value = max for matching

replacement = yes/no  whether controls can be matched to more than one case

out = output data set name

example call:

 %Matching(datatreatment= T, datacontrol= C, method= RC,
  numberofcontrols= 1, caliper=, calcpercent=.20, replacement= no, out= matches);

************************************************/
rsubmit;
%macro Matching(datatreatment=, datacontrol=, method=, numberofcontrols=, caliper=,
 calcpercent=, replacement=, out=);

    /* Create copies of the treated units if N > 1 */;
     data _Treatment0(drop= i);
      set &datatreatment;
      do i= 1 to &numberofcontrols;
      RandomNumber= ranuni(12345);
    output;
    end;
    run;
    /* Randomly sort both datasets */
    proc sort data= _Treatment0 out= _Treatment(drop= RandomNumber);
    by RandomNumber;
    run;
    data _Control0;
    set &datacontrol;
    RandomNumber= ranuni(45678);
    run;
    proc sort data= _Control0 out= _Control(drop= RandomNumber);
    by RandomNumber;
    run;

     data Matched (keep = cikSelectedControl atControl roacontrol roatreat fyear industry MatchedToTreatcik atTreat);
      length atC 8;
      length cikC 8;
      /* Load Control dataset into the hash object */
      if _N_= 1 then do;
    declare hash h(dataset: "_Control", ordered: 'no');
    declare hiter iter('h');
    h.defineKey('cikC');
    h.defineData('roac','atC','cikC');
    h.defineDone();
    call missing(cikC, atC, roac);
    end;
    /* Open the treatment */
    set _Treatment;
    %if %upcase(&method) ~= RADIUS or %upcase(&method) ~= CR %then %do;
    retain BestDistance 99;
    %end;
    /* Iterate over the hash */
    rc= iter.first();
    if (rc=0) then BestDistance= 99;
    do while (rc = 0);
    /* Caliper */
    %if %upcase(&method) = CALIPER %then %do;
    if (atT - &caliper) <= atC <= (atT + &caliper) then do;
    ScoreDistance = abs(atT - atC);
    if ScoreDistance < BestDistance then do;
    BestDistance = ScoreDistance;
    cikSelectedControl = cikC;
    atControl =  atC;
    MatchedToTreatcik = cikT;
    atTreat = atT;
    end;
    end;
    %end;
    /* Calculated caliper */
    %if %upcase(&method) = CC %then %do;
    ccdist = &calcpercent*atT;
    if (atT - ccdist) <= atC <= (atT + ccdist) then do;
    ScoreDistance = abs(atT - atC);
    if ScoreDistance < BestDistance then do;
    BestDistance = ScoreDistance;
    cikSelectedControl = cikC;
    atControl =  atC;

    MatchedToTreatcik = cikT;
    atTreat = atT;
    ROAControl = roaC;
    ROATreat=roat;
    end;
    end;
    %end;
    /* NN */
    %if %upcase(&method) = NN %then %do;
    ScoreDistance = abs(atT - atC);
    if ScoreDistance < BestDistance then do;
    BestDistance = ScoreDistance;
    cikSelectedControl = cikC;
    atControl =  atC;
    MatchedToTreatcik = cikT;
    atTreat = atT;
    end;
    %end;

    %if %upcase(&method) = NN or %upcase(&method) = CALIPER or %upcase(&method) = CC %then %do;
    rc = iter.next();
    /* Output the best control and remove it */
    if (rc ~= 0) and BestDistance ~=99 then do;
    output;
    %if %upcase(&replacement) = NO %then %do;
    rc1 = h.remove(key: cikSelectedControl);
    %end;
    end;
    %end;
    /* Radius */
    %if %upcase(&method) = RADIUS %then %do;
    if (atT - &caliper) <= atC <= (atT + &caliper) then do;
    cikSelectedControl = cikC;
    atControl =  atC;
    MatchedToTreatcik = cikT;
    atTreat = atT;
    ROAControl = roaC;
    ROATreat=roat;
    output;
    end;
    rc = iter.next();
    %end;
    /* Calculated Radius */
    %if %upcase(&method) = CR %then %do;
    rcdist = &calcpercent*atT;
    if (atT - rcdist) <= atC <= (atT + rcdist) then do;
    cikSelectedControl = cikC;
    atControl =  atC;
    MatchedToTreatcik = cikT;
    atTreat = atT;
    ROAControl = roaC;
    ROATreat=roat;
    output;
    end;
    rc = iter.next();
    %end;
    end;
    run;
    /*for when testing and  using wrds
    proc download data=matched; run;
    proc download data=_Control; run;*/
    /* Delete temporary tables. Quote for debugging */
    proc datasets NOLIST; /*Nolist option should prevent printing of dataset list*/
    delete _:(gennum=all);

    run;
     data &out;
       set Matched;
     run;
    proc datasets NOLIST; /*Nolist option should prevent printing of dataset list*/
    delete Matched;
    %mend Matching;