Consolidate data based on validity ranges and variables

Question

I need to consolidate observations based on consecutive validity ranges.

Here is a reproducible minimal example:

data have;
infile datalines4 delimiter=",";
input id $ start :date9. end :date9. var1 var2 var3;
format start end date9.;
datalines4;
ABC1,01DEC2020,21FEB2021,1,0,1
ABC1,22FEB2021,10MAR2021,1,0,1
ABC1,11MAR2021,31DEC9999,1,1,1
DEF2,11DEC2020,10JAN2021,0,0,0
DEF2,11JAN2021,19JAN2021,1,1,1
DEF2,20JAN2021,02FEB2021,1,1,1
DEF2,03FEB2021,10MAR2025,1,0,1
;;;;
quit;

For each id, if the validity dates are consecutive (i.e. end_date + 1day of the n-1 observation is equal to the start_date of the n observation) and if they share the same value for all three variables (var1, var2 and var3), I want to consolidate those multiple observations into a single one.

What I would like to have:

id	start	end	var1	var2	var3
ABC1	01DEC2020	10MAR2021	1	0	1
ABC1	11MAR2021	31DEC9999	1	1	1
DEF2	11DEC2020	10JAN2021	0	0	0
DEF2	11JAN2021	02FEB2021	1	1	1
DEF3	03FEB2021	10MAR2025	1	0	1

Any kind of approach (array/hash/sql) would be appreciated !

EDIT: What I tried

proc sql;
    select name into :var separated by "," from sashelp.vcolumn
        where libname = "WORK" 
            and memname = "HAVE" 
            and name not in ("id","start","end");
quit;

data stage1;
    length conc $2000.;
    set have;
    conc = catx("",&var.);
run;

data stage2;
    set stage1;
    by id conc notsorted;

    if first.conc then
        group_number+1;
run;

proc sort data=stage2 out=stage3;
    by id conc group_number start;
run;

data stage4;
    do until (last.conc);
        set stage3;
        by id conc group_number;

        if first.group_number then
            _start = start;

        if last.conc then
            do;
                start = _start;
                OUTPUT;
            end;
    end;

    drop _start;
run;

proc sort data=stage4 out=want(drop=group_number);
    by id start;
run;

It creates the correct output, however there must be a better way of doing this by not concatenating all the variables...

Look through our older questions, this is not the first time I've seen this for sure. At least, what's your idea for how to solve this? Beginning code to do that? — Joe
@Joe Edited my question with what I tried. I am basically trying to find a better way of doing this. — Kermit

Tom Tom · Accepted Answer · 2021-04-23T18:03:44

Use the NOTSORTED feature of the BY statement to detect the beginning of new VAR1-VAR3 combination so you can "group" your observations. Data needs to be sorted by ID START END. You could use a data step view to save disk space if your data is large.

data groups / view=groups;
  set have ;
  by id var1-var3 notsorted ;
  lag_end=lag(end);
  if first.id then group=1;
  else if lag_end+1 ne start or first.var3 then group+1;
run;

Then collapse the groups to one observation.

data want;
  set groups ;
  by id group;
  if first.group then lag_start=start;
  retain lag_start;
  if last.group;
  start=lag_start;
  drop lag_: ;
run;

Result:

Obs     id         start          end    var1    var2    var3    group

 1     ABC1    01DEC2020    10MAR2021      1       0       1       1
 2     ABC1    11MAR2021    31DEC9999      1       1       1       2
 3     DEF2    11DEC2020    10JAN2021      0       0       0       1
 4     DEF2    11JAN2021    02FEB2021      1       1       1       2
 5     DEF2    03FEB2021    10MAR2025      1       0       1       3

To make it dynamic use the same method of getting the list of BY variables only use space delimited list instead of commas.

proc sql noprint;
   select name into :varlist separated by ' ' 
      from dictionary.columns 
      where libname = 'WORK' and memname='HAVE'
        and upcase(name) not in ('ID' 'START' 'END')
   ;
end;

Then change these two lines:

by id &varlist notsorted;
...
... first.%scan(&varlist,-1,%str( )) ...

Consolidate data based on validity ranges and variables

2 Answers