0
votes

I have a SAS dataset that looks like this:

id | dept | ...
1    A
2    A
3    A
4    A
5    A
6    A
7    A
8    A
9    B
10   B
11   B
12   B
13   B

Each observation represents a person.

I would like to split the dataset into "team" datasets, each dataset can have a maximum of 3 observations.

For the example above this would mean creating 3 datasets for dept A (2 of these datasets would contain 3 observations and the third dataset would contain 2 observations). And 2 datasets for dept B (1 containing 3 observations and the other containing 2 observations).

Like so:

First dataset (deptA1):

id | dept | ...
1    A
2    A
3    A

Second dataset (deptA2)

id | dept | ...
4    A
5    A
6    A

Third dataset (deptA3)

id | dept | ...
7    A
8    A

Fourth dataset (deptB1)

id | dept | ...
9    B
10   B
11   B

Fifth dataset (deptB2)

id | dept | ...
12   B
13   B

The full dataset I'm using contains thousands of observations with over 50 depts. I can work out how many datasets per dept are required and I think a macro is the best way to go as the number of datasets required is dynamic. But I can't figure out the logic to create the datasets so that they have have a maximum of 3 observations. Any help appreciated.

2
Why do you want to split your dataset up like this? In general, it is bad practice to maintain your data as many datasets - it's very hard to work with, you have to run everything 50 times, and any change you make has to be made accurately to all 50 datasets. If you want to do some analysis by team, SAS has a very powerful concept called the "BY" statement - just create a new variable, assign the team value to it, and then any analysis you do can be done BY team; , and will act as if you had 50 different datasets - just all in one physical dataset. See goo.gl/tfwcr for one example.Joe

2 Answers

1
votes

Another version. Compared to DavB version, it only processes input data once and splits it into several tables in single datastep. Also if more complex splitting rule is required, it can be implemented in datastep view WORK.SOURCE_PREP.

data WORK.SOURCE;
infile cards;
length ID 8 dept $1;
input ID dept;
cards;
1    A
2    A
3    A
4    A
5    A
6    A
7    A
8    A
9    B
10   B
11   B
12   B
13   B
14   C
15   C
16   C
17   C
18   C
19   C
20   C
;
run;

proc sort data=WORK.SOURCE;
by dept ID;
run;

data  WORK.SOURCE_PREP / view=WORK.SOURCE_PREP;
set WORK.SOURCE;
by dept;
length table_name $32;

if first.dept then do;
    count = 1;
    table = 1;
end;
else count + 1;

if count > 3 then do;
    count = 1;
    table + 1;
end;
/* variable TABLE_NAME to hold table name */
TABLE_NAME = catt('WORK.', dept, put(table, 3. -L));
run;

/* prepare list of tables */
proc sql noprint;
create table table_list as
select distinct TABLE_NAME from WORK.SOURCE_PREP where not missing(table_name)
;
%let table_cnt=&sqlobs;
select table_name into :table_list separated by ' ' from table_list;
select table_name into :tab1 - :tab&table_cnt from table_list;
quit;

%put &table_list;

%macro loop_when(cnt, var);
    %do i=1 %to &cnt;
        when ("&&&var.&i") output &&&var.&i;
    %end;
%mend;

data &table_list;
set WORK.SOURCE_PREP;
    select (TABLE_NAME);
        /* generate OUTPUT statements */
        %loop_when(&table_cnt, tab)
    end;
run;
1
votes

You could try this:

%macro split(inds=,maxobs=);

  proc sql noprint;
    select distinct dept into :dept1-:dept9999
    from &inds.
    order by dept;
    select ceil(count(*)/&maxobs.) into :numds1-:numds9999
    from &inds.
    group by dept
    order by dept;
  quit;
  %let numdept=&sqlobs;

  data %do i=1 %to &numdept.;
         %do j=1 %to &&numds&i;
           dept&&dept&i&&j.
         %end;
       %end;;
    set &inds.;
    by dept;
    if first.dept then counter=0;
    counter+1;
    %do i=1 %to &numdept.;
      %if &i.=1 %then %do;
        if
      %end;
      %else %do;
        else if
      %end;
                dept="&&dept&i" then do;
      %do k=1 %to &&numds&i.;
        %if &k.=1 %then %do;
          if
        %end;
        %else %do;
          else if
        %end;
                 counter<=&maxobs.*&k. then output dept&&dept&i&&k.;
      %end;
      end;
    %end;
  run;
%mend split;

%split(inds=YOUR_DATASET,maxobs=3);

Just replace the INDS parameter value in the %SPLIT macro call to the name of your input data set.