2
votes

I want to use dataset B to overwrite some values in dataset A by merging dataset A & B with a merging ID. However it doesn't work as expected. Here is the test I did:

/* create table A */
data a;
  infile datalines;
  input id1 $ id2 $ var1;
datalines;
    1 a 10
    1 b 10
    2 a 10
    2 b 10
;
run;

/* create table B */
data b;
  infile datalines;
  input id1 $ var1 var2;
datalines;
    1 20 30
    2 20 30
;
run; 

/* merge A&B to overwrite var1 in table A using values in table B */
data c;
    merge a b;
    by id1;
run;

Table C looks like this:

 ID1 ID2 VAR1 VAR2

 1   a     20   30

 1   b     10   30

 2   a     20   30

 2   b     10   30

Why the 10s in row 2&4 didn't get replaced by 20 from table B? While var2 works as expected?

I know I can do this simply using proc SQL, and that's what I did to solve the problem. But I still quite curious if there is a way to do what I wanted using merge? And why this wasn't working? I prefer merge over SQL in this circumstance because the logic is easier to implement (util I found this not working properly).

I use SAS 9.4.

3
You have duplicate values by id1 in table A. I think there's no simpe approach by using any merge/update/modify to achieve this. - Lovnlust
Don't use code snippets with anything other than the languages they explicitly work with (mostly, html/javascript/etc.). - Joe
Also relevant (but not an exact dup), stackoverflow.com/questions/25251177/… - Joe

3 Answers

2
votes

This has to do with how SAS iterates over the data sets during the merge. Basically, the second record for each of A doesn't get lined up with a record from B. The value of VAR2 is carried over from the previous record. VAR1 gets its value from A (because there is no B).

IF there is record in B for EVERY ID1, then you can rewrite your merge like this to achieve what you want.

/* merge A&B to overwrite var1 in table A using values in table B */
data c;
    merge a(drop=var1) b;
    by id1;
run;

This drops the VAR1 from A so that it is carried down from the record in B.

Otherwise you will need more complex logic (might I suggest an SQL left join with the coalesce() function?).

2
votes

Like DomPazz suggests, proc sql is the way to do this. merge will only keep one value from each data set. The coalesce function pick the first non-missing value from the list, so it uses var1 from b, but if b.var1 is null then it uses a.var1.

proc sql;
create table c as
select
  a.id1,
  a.id2,
  coalesce(b.var1,a.var1) as var1,
  b.var2
from
  a
  left join b
  on a.id1 = b.id1
;
quit;
1
votes

The merge method could still work fine, you would just need to be more explicit about how to choose the 'best' value for var1, such as:

data c (drop = a_var1 b_var1);
    merge a(rename=(var1 = a_var1))
          b(rename=(var1 = b_var1));
    by id1;
    * Now you have two different variables named a_var1 and b_var1;
    * Implement logic to choose your favorite;
    if NOT MISSING(b_var1) Then DO;
        var1 = b_var1;
        var1_source='B';
    END;
    else DO;
        var1 = a_var1;
        var1_source='A';
    END;
run;

If your criteria for which 'var1' to choose is as simple as 'If b exists, use it' then this is identical to the the SQL method with coalesce().

Where I've found this method useful is for more complicated criteria, plus its always nice to know the source of the data (which doesn't happen with coalesce()).