1
votes

I have multiple datasets. Each of them has different number of attributes. I want to merge them all by common variable. This is 'union' if I use proc SQL. But there is hunderds of variables. Example.

Dataset_Name    Number of columns   
dataset1           110
dataset2           120
dataset3           130
...                ...

Say they have 100 columns in common. The final dataset which contains all dataset1,dataset2,dataset3..etc only has common columns(in this case, 100 columns).

How do I do this?

And how do I get columns for each dataset this is not in common with the final dataset. example: dataset1 will have 10 columns that are not in the final dataset, and list the name of 10 columns.

Thanks!!!!

1

1 Answers

2
votes

UNION in SQL is equivalent to sequential SET in SAS.

data want;
set dataset1 dataset2 dataset3;
run;

Now, SAS by default includes all columns present in any dataset. To limit to just what's in all datasets, you have to use a keep statement.

You can determine this using proc sql, among other ways.

proc sql;
select name into :commonlist separated by ' '
  from dictionary.columns C, dictionary.columns D
  where C.libname=D.libname
  and C.memname='DATASET1'
  and D.memname='DATASET2'
  and C.name=D.name
;
quit;

For more than two datasets it's more complicated and partially depends on your, but if you're comfortable in SQL you can figure that out pretty easily. A similar construct can create a list of just dataset 1 variables. The important part is the into :commonlist separated by ' ', which says to pull the select results into a macro variable called commonlist, separating rows by space. (The colon says to create a macro variable, not a table.)

So you can then run:

data want (keep=&commonlist.) dset1(keep=&dset1list.) dset2(keep=&dset2list.);
  set dataset1(in=ds1) dataset2(in=ds2) dataset3(in=ds3);
  output want;
  if ds1 then output dset1;
  else if ds2 then output dset2;
  else if ds3 then output dset3;
run;

The in=xyz indicates which dataset a row came from. Each output dataset can have a separate list of variables to keep. You might want to keep the ID variable in those other datasets as well.

I will say that usually in SAS you don't do what you're doing here: it's not easy to do because it doesn't tend to be the best way to handle things - specifically, the little split off datasets. In general you would just keep those extra variables on the master dataset, and they'd just be nulls for anyone not in a dataset with that variable - assuming it makes sense to make this 'master' dataset at all.