I have two data sets. Their subject is the same group, but different "samples" are drawn by different researchers. Hence,
I cannot identify matches (i.e. Peter Smith in both samples),
variables are named differently though they mean the same,
- variables are in different places of the data sets
Now I would like to compare the two data sets to see how similar they are as a whole. I thought about comparing frequency outputs for variables of which I know they have the same meaning.
So far my solution is rather cumbersome in my eyes, because I would have to do all the steps for each combination of variables.
Is there a more elegant / efficient solution out there?
I would also like to compare mean, median, percentiles for numeric variables.
Thank you very much in advance!
Gerit
Here is my current solution for an example. I named the observations all differently as I cannot, in my original file, tell from the ID variable which observation is which.
data have1;
input Name $ road means goal ;
datalines;
adam 1 3 0
bob 1 1 1
clint 2 2 0
dean 3 1 1
eric 2 1 0
flint 1 2 1
gerald 3 1 1
;
run;
data have2;
input Name $ street finish other purpose ;
datalines;
harry 2 1 0 3
idefix 1 0 0 2
john 3 1 1 2
kelvin 1 0 2 2
liam 2 1 2 1
max 3 1 2 1
nero 2 0 1 3
ovid 3 0 2 3
;
run;
proc freq data=have1;
tables road / out= fhave1road
(rename=(percent=pct1 count=count1));
quit;
proc freq data=have2;
tables street / out= fhave2street
(rename=(street = road percent=pct2 count=count2));
quit;
data comb;
merge fhave1road fhave2street;
by road;
diffpct = pct1 - pct2;
diffct = count1 - count2;
run;
proc print data = comb;
var road count1 count2 diffct pct1 pct2 diffpct;
quit;
proc means data=have1;
var road;
output out=mhave1road ;
quit;
proc sort data=mhave1road;
by _STAT_;
quit;
proc means data=have2;
var street;
output out=mhave2street ;
quit;
proc sort data=mhave2street;
by _STAT_;
quit;
data mcomb (keep=_STAT_ road street diff) ;
merge mhave1road mhave2street;
by _STAT_;
diff = road-street;
run;
proc print data = mcomb;
quit;