0
votes

I have two data sets. Their subject is the same group, but different "samples" are drawn by different researchers. Hence,

  • I cannot identify matches (i.e. Peter Smith in both samples),

  • variables are named differently though they mean the same,

  • variables are in different places of the data sets

Now I would like to compare the two data sets to see how similar they are as a whole. I thought about comparing frequency outputs for variables of which I know they have the same meaning.

So far my solution is rather cumbersome in my eyes, because I would have to do all the steps for each combination of variables.

  1. Is there a more elegant / efficient solution out there?

  2. I would also like to compare mean, median, percentiles for numeric variables.

Thank you very much in advance!

Gerit

Here is my current solution for an example. I named the observations all differently as I cannot, in my original file, tell from the ID variable which observation is which.

data have1;                                                   
    input Name $ road means goal ;                           
    datalines;                                                 
adam 1 3 0                                                   
bob 1 1 1                                                      
clint 2 2 0                                                     
dean 3 1 1                                                     
eric 2 1 0 
flint 1 2 1                                                  
gerald 3 1 1                                                 
;                                                            
run;                                                         
data have2;                                                  
    input Name $ street finish other purpose ;               
    datalines;                                               
harry  2 1 0 3                                               
idefix 1 0 0 2                                               
john   3 1 1 2                                               
kelvin 1 0 2 2                                               
liam   2 1 2 1                                               
max    3 1 2 1                                               
nero   2 0 1 3                                               
ovid   3 0 2 3                                               
;                                                            
run;                                                         

proc freq data=have1;                                        
     tables road / out= fhave1road                           
            (rename=(percent=pct1 count=count1));              
quit;                                                        
proc freq data=have2;                                        
     tables street / out= fhave2street                       
            (rename=(street = road percent=pct2 count=count2));
quit;                                                        

data comb;                                                   
     merge fhave1road fhave2street;                          
     by road;                                                
diffpct = pct1 - pct2;                                       
diffct = count1 - count2;                                    
run;                                                         

proc print data = comb;                                      
     var road count1 count2 diffct pct1 pct2 diffpct;        
quit;                                                        

proc means data=have1;
     var road;
     output out=mhave1road ;
quit;
proc sort data=mhave1road;
     by _STAT_;
quit;
proc means data=have2;
     var street;
     output out=mhave2street ;
quit;
proc sort data=mhave2street;
     by _STAT_;
quit;
data mcomb  (keep=_STAT_ road street diff) ; 
     merge mhave1road mhave2street;     
     by _STAT_; 
diff = road-street; 
run;
proc print data = mcomb; 
quit;              
1
Do not merge unrelated datasets! Fortunate for you they have the same number of observations and your statistics make sense, but the dataset you create is crap.Dirk Horsten

1 Answers

0
votes

You will get quite some information from

title 'have1';
proc means data=have1 min p25 mean median p75 max;
run;
title 'have2';
proc means data=have2 min p25 mean median p75 max;
run; 

or a graphical equivalent with box plots:

title 'have1';
proc transpose data=have1 out=trans1;
    by Name;
    var road means goal;
run;
title 'have2';
proc transpose data=have2 out=trans2;
    by Name;
    var street finish other purpose;
run;
title 'both';
data both;
    set trans1 (in=in1) trans2 (in=in2);
    if in1 then source = 1;
    if in2 then source = 2; 
run;
proc sgpanel data=both;
    panelby source;
    vbox col1 / category= _name_;
run;