1
votes

I have a dataset with each observation having two space-separated lists as string variables. I want a third variable showing the overlap between the string lists. Using another SO post, I've created a macro to calculate the overlap. I can't work out how to implement it in a DATA step to get the third variable.

This is my dataset, with dummy data:

data use;
    infile datalines dlm='~~';
    input list1:$100. list2:$100. expected_match:$10.;
datalines;
Homer Bart~~Homer Bart~~Full
Marge Lisa~~Lisa Marge~~Full
Homer Marge~~Marge~~Partial
Bart Lisa~~Bart~~Partial
Homer Marge Bart Lisa~~Maggie~~None
;;;;
run;

This is the macro, with tests (all of which pass):

%macro list_overlap(list1, list2);
%local i matches match_type;
%let matches = 0;
%do i = 1 %to %sysfunc(countw(&list1, %str( )));
    %if %sysfunc(findw(&list2, %scan(&list1, &i,, s)))
        %then %let matches = %eval(&matches + 1);
    %end;
    %if &matches = %sysfunc(countw(&list1, %str( )))
        and %sysfunc(countw(&list1, %str( ))) = %sysfunc(countw(&list2, %str( )))
        %then %let match_type = 'Full';
    %else %if &matches = 0 %then %let match_type = 'None';
    %else %let match_type = 'Partial';
match_type = &match_type%str(;)
%mend list_overlap;
%put NOTE: %list_overlap(Homer Bart,Homer Bart);
%put NOTE: %list_overlap(Marge Lisa,Lisa Marge);
%put NOTE: %list_overlap(Homer Marge,Marge);
%put NOTE: %list_overlap(Bart Lisa,Bart);
%put NOTE: %list_overlap(Homer Marge Bart List,Maggie);

This is how I'm trying to implement it in a DATA step:

data matches;
    set use;
    call execute(catt('%list_overlap(', list1, ',', list2, ')'));
run;

I'm getting the following error with this case:

NOTE: Line generated by the CALL EXECUTE routine.
1         + match_type = 'Full';
            __________
            180

ERROR 180-322: Statement is not valid or it is used out of proper order.

I've tried other ways too, but this is the closest I've got.

2
Why do you repeat the same character twice in the DLM= option on the INFILE statement? Did you mean to use the DLMSTR= option instead?Tom

2 Answers

2
votes

Looks like you want the RESOLVE() function instead of CALL EXECUTE.

data matches;
  set use;
  match_type = resolve(cats('%list_overlap(',list1,',',list2,')'));
run;

But with your current definition of the macro that will include all of characters the macro generates, match_type = 'Full';, into the value of the MATCH_TYPE variable. So remove the superfluous characters the macro is currently generating so that it only generates the value you want to save.

... %then Full;
%else %if matches eq 0 %then None;
%else Partial;
1
votes

Your problem here is that call execute isn't doing what you think, I suspect.

What's happening:

  • Data step runs, call execute lines generated
  • Then the macro stuff is call executed, and you have:

Code example:

data matches;
  set use;
  call execute(stuff);
run;

match_type = 'Full';

That's not legal - that's a data step line but not in a data step.

Instead of doing all of that work in macro land, do it in data step land. Works just as well, and gets done what you want.

Something like this:

%macro list_overlap(list1,list2);
  matches=0;
  length match_type $7;
  do i = 1 to countw(&list1,' ');
    if findw(&list2,scan(&list1,i,' '))
      then matches = matches + 1;
  end;
  if matches eq countw(&list1,' ') then match_type = 'Full';
  else if matches eq 0 then match_type = 'None';
  else match_type = 'Partial';
%mend list_overlap;

Something like that, I can't test it right now, but that should generally work. Then don't call execute the macro, just call it normally.

data matches;
  set use;
  %list_overlap(list1,list2);
run;