0
votes

I need to consolidate observations based on consecutive validity ranges.

Here is a reproducible minimal example:

data have;
infile datalines4 delimiter=",";
input id $ start :date9. end :date9. var1 var2 var3;
format start end date9.;
datalines4;
ABC1,01DEC2020,21FEB2021,1,0,1
ABC1,22FEB2021,10MAR2021,1,0,1
ABC1,11MAR2021,31DEC9999,1,1,1
DEF2,11DEC2020,10JAN2021,0,0,0
DEF2,11JAN2021,19JAN2021,1,1,1
DEF2,20JAN2021,02FEB2021,1,1,1
DEF2,03FEB2021,10MAR2025,1,0,1
;;;;
quit;

For each id, if the validity dates are consecutive (i.e. end_date + 1day of the n-1 observation is equal to the start_date of the n observation) and if they share the same value for all three variables (var1, var2 and var3), I want to consolidate those multiple observations into a single one.

What I would like to have:

id start end var1 var2 var3
ABC1 01DEC2020 10MAR2021 1 0 1
ABC1 11MAR2021 31DEC9999 1 1 1
DEF2 11DEC2020 10JAN2021 0 0 0
DEF2 11JAN2021 02FEB2021 1 1 1
DEF3 03FEB2021 10MAR2025 1 0 1

Any kind of approach (array/hash/sql) would be appreciated !

EDIT: What I tried

proc sql;
    select name into :var separated by "," from sashelp.vcolumn
        where libname = "WORK" 
            and memname = "HAVE" 
            and name not in ("id","start","end");
quit;

data stage1;
    length conc $2000.;
    set have;
    conc = catx("",&var.);
run;

data stage2;
    set stage1;
    by id conc notsorted;

    if first.conc then
        group_number+1;
run;

proc sort data=stage2 out=stage3;
    by id conc group_number start;
run;

data stage4;
    do until (last.conc);
        set stage3;
        by id conc group_number;

        if first.group_number then
            _start = start;

        if last.conc then
            do;
                start = _start;
                OUTPUT;
            end;
    end;

    drop _start;
run;

proc sort data=stage4 out=want(drop=group_number);
    by id start;
run;

It creates the correct output, however there must be a better way of doing this by not concatenating all the variables...

2
Look through our older questions, this is not the first time I've seen this for sure. At least, what's your idea for how to solve this? Beginning code to do that?Joe
@Joe Edited my question with what I tried. I am basically trying to find a better way of doing this.Kermit

2 Answers

2
votes

Use the NOTSORTED feature of the BY statement to detect the beginning of new VAR1-VAR3 combination so you can "group" your observations. Data needs to be sorted by ID START END. You could use a data step view to save disk space if your data is large.

data groups / view=groups;
  set have ;
  by id var1-var3 notsorted ;
  lag_end=lag(end);
  if first.id then group=1;
  else if lag_end+1 ne start or first.var3 then group+1;
run;

Then collapse the groups to one observation.

data want;
  set groups ;
  by id group;
  if first.group then lag_start=start;
  retain lag_start;
  if last.group;
  start=lag_start;
  drop lag_: ;
run;

Result:

Obs     id         start          end    var1    var2    var3    group

 1     ABC1    01DEC2020    10MAR2021      1       0       1       1
 2     ABC1    11MAR2021    31DEC9999      1       1       1       2
 3     DEF2    11DEC2020    10JAN2021      0       0       0       1
 4     DEF2    11JAN2021    02FEB2021      1       1       1       2
 5     DEF2    03FEB2021    10MAR2025      1       0       1       3

To make it dynamic use the same method of getting the list of BY variables only use space delimited list instead of commas.

proc sql noprint;
   select name into :varlist separated by ' ' 
      from dictionary.columns 
      where libname = 'WORK' and memname='HAVE'
        and upcase(name) not in ('ID' 'START' 'END')
   ;
end;

Then change these two lines:

by id &varlist notsorted;
...
... first.%scan(&varlist,-1,%str( )) ...
1
votes

Your approach isn't a bad one, to be honest, though it's obviously a bit extra compared to what you probably need. But realistically, if the caveats with that method don't apply, I don't think you should necessarily change approaches. The caveats are that the concatenation could result in incorrect data in some cases - if the variables can have missing values, in particular, though you can work around that with CATQ.

Don't use CATX with space (and "" is space, despite its appearances otherwise, SAS has no "null" outside of some special case like TRIMN - you can do that, though, catx(trimn(' '),...)). Use CATX with a delimiter unlikely to appear in the data, like | or ~. And like I said, maybe use CATQ which lets you add the m modifier which leads missings to not be ignored.

If you need a higher performant solution, it's possible to do this with the hash object, but frankly I don't recommend it - the code is much, much messier (you have to use a hash iterator, and a bunch more code). A keyed modify might also work, but again it will tend to be messier. If I were to do this, I might do the hash object solution, but most likely I'd either do your solution, or the non-concatenating version of your solution (where we do it like this):

proc sql;
    select name into :var separated by " " from sashelp.vcolumn
        where libname = "WORK" 
            and memname = "HAVE" 
            and name not in ("id","start","end");
quit;

proc sort data=have out=stage1;
    by id start end;
run;
data stage2;
    set have;
    by id &var. notsorted;
    lagend = lag(end) + 1;
    if first.id then call missing(lagend);
    format lagend date9.;
    if (first.var3) or (start ne lagend) then
        group_number+1;
run;

proc sort data=stage2 out=stage3;
    by id group_number start;
run;

data stage4;
    do until (last.group_number);
        set stage3;
        by id group_number;

        if first.group_number then
            _start = start;

        if last.group_number then
            do;
                start = _start;
                OUTPUT;
            end;
    end;

    drop _start;
run;

proc sort data=stage4 out=want(drop=group_number);
    by id start;
run;

Not dramatically different, does avoid that big $2000 variable of course, and avoids things like formats causing issues. I don't think you actually checked for start = lag(end)+1, either, so you probably want to do that. Note I do hardcode var3 in one place - I think you can avoid that by using the macro language to find the last word in &var, if you like, or otherwise finding what the last word is. You can also do what one of my coworkers does, which is to add a dummy variable after the variable list, so:

by id &var. dummyvar notsorted;

And then you can use if first.dummyvar. Just requires you to have dummyvar somewhere in there ahead of time (and the same value for all rows). You might even be able to add id a second time in the by statement after &var., but I find that more confusing to read.