0
votes

So, I have a significant problem with proc compare. I have two datasets with the two columns. One column lists table names and the other one - names of variables which correspond to table names from the first column. I want compare values of one of them based on the values of first column. I somewhat made it work but the thing is that these datasets have different sizes due to additional values in one of them. Which means that some new variable was added in the middle of a dataset (new variable was added to a table). Unfortunately, proc compare compares values from two datasets horizontally and checks them against each other for values, so in my case it looks like this:

ds 1 | ds 2

cost | box_nr

other | cost_total

As you can see, a new value box_nr was added to the second dataset that appears above the value that I want it to compare variable cost to (cost_total). So I would like to know if it's possible to compare values (check for differences in character sequence) that have at least minimal similarity - for example 3 letters (cos) or if it's possible to just put values like box_nr at the end suggesting that they don't appear in a certain dataset.

My code:

PROC Compare base=USERSPDS.MIzew compare=USERSPDS.MIwew
    out=USERSPDS.result outbase outcomp outdif noprint; 
    id 'TABLE  HD'n;
    where ;
run; 

proc print data=USERSPDS.result noobs;
   by 'TABLE  HD'n;
   id 'TTABLE  HD'n;
   title 'COMPARISON:';
run; 
2
I tried for half an hour to understand your question without success. What if you rewrite it?Dirk Horsten

2 Answers

0
votes

Untested, but this should get you some of the way.

proc sql;
   create table compare as
   select 
      coalesce(a.cola, b.cola) as cola,
      a.colb as acolb,
      b.colb as bcolb
   from dataa as a
   full outer join datab as b
      on 
         a.cola = b.cola and
         compged(a.colb, b.colb) <= 100;
quit;

Have a look at the compged documentation for further information.

0
votes

Sounds like you could make a new variable in both datasets, VAR3chars=substr(var,1,3) and then add that variable to your ID statement. I think that should work unless there are duplicate values.

So if one dataset had var="cost" and the other had var="cost_total", they would match on the id so they would be compared and found to be different.

If one dataset had var="box_nr" and the other did not have any values starting with "box", they would not match on the id so compare would find that a record exists for that id in one dataset but not the other.