I need to consolidate observations based on consecutive validity ranges.
Here is a reproducible minimal example:
data have;
infile datalines4 delimiter=",";
input id $ start :date9. end :date9. var1 var2 var3;
format start end date9.;
datalines4;
ABC1,01DEC2020,21FEB2021,1,0,1
ABC1,22FEB2021,10MAR2021,1,0,1
ABC1,11MAR2021,31DEC9999,1,1,1
DEF2,11DEC2020,10JAN2021,0,0,0
DEF2,11JAN2021,19JAN2021,1,1,1
DEF2,20JAN2021,02FEB2021,1,1,1
DEF2,03FEB2021,10MAR2025,1,0,1
;;;;
quit;
For each id, if the validity dates are consecutive (i.e. end_date + 1day
of the n-1
observation is equal to the start_date
of the n
observation) and if they share the same value for all three variables (var1, var2 and var3), I want to consolidate those multiple observations into a single one.
What I would like to have:
id | start | end | var1 | var2 | var3 |
---|---|---|---|---|---|
ABC1 | 01DEC2020 | 10MAR2021 | 1 | 0 | 1 |
ABC1 | 11MAR2021 | 31DEC9999 | 1 | 1 | 1 |
DEF2 | 11DEC2020 | 10JAN2021 | 0 | 0 | 0 |
DEF2 | 11JAN2021 | 02FEB2021 | 1 | 1 | 1 |
DEF3 | 03FEB2021 | 10MAR2025 | 1 | 0 | 1 |
Any kind of approach (array/hash/sql) would be appreciated !
EDIT: What I tried
proc sql;
select name into :var separated by "," from sashelp.vcolumn
where libname = "WORK"
and memname = "HAVE"
and name not in ("id","start","end");
quit;
data stage1;
length conc $2000.;
set have;
conc = catx("",&var.);
run;
data stage2;
set stage1;
by id conc notsorted;
if first.conc then
group_number+1;
run;
proc sort data=stage2 out=stage3;
by id conc group_number start;
run;
data stage4;
do until (last.conc);
set stage3;
by id conc group_number;
if first.group_number then
_start = start;
if last.conc then
do;
start = _start;
OUTPUT;
end;
end;
drop _start;
run;
proc sort data=stage4 out=want(drop=group_number);
by id start;
run;
It creates the correct output, however there must be a better way of doing this by not concatenating all the variables...