How to summarize different combinations of diseases for different age groups?

Question

We have a study of people who have been enrolled to a study at different time points and from different age groups. They have been followed up for two decades and during this time they have developed 1-5 diseases. The diseases are developing at different time points. Here is the code for an example data in SAS:

proc format;
  value agegrp
    30-39 = '30-39'
    40-49 = '40-49'
    50-59 = '50-59'
    60-69 = '60-69'
    70-79 = '70-79'
  ;
  invalue agegrp
    '30-39' = 30
    '40-49' = 40
    '50-59' = 50
    '60-69' = 60
    '70-79' = 70
  ;
run;

* generate some sample data;
%macro RandBetween(min, max);
   (&min + floor((1+&max-&min)*rand("uniform")))
%mend;


data have;
  call streaminit(123);
 
  do id = 1 to 10000;
    enrolled = '01jan2000'd + (1 + floor((1+3650-1)*rand("uniform")));
    age = 30 + %RandBetween(0, 49);

    flag1 = rand('uniform') < 0.25;
    date1 = enrolled + %RandBetween(0, 2500);

    flag2 = rand('uniform') < 0.25;
    date2 = date1 + %RandBetween(0,2500);

    flag3 = rand('uniform') < 0.25;
    date3 = date2 + %RandBetween(0,2500);

    flag4 = rand('uniform') < 0.25;
    date4 = date3 + %RandBetween(0,2500);

    flag5 = rand('uniform') < 0.25;
    date5 = date4 + %RandBetween(0,2500);
    output;
  end;
 format enrolled date: yymmdd10. flag: 1.;
run;

I have summarized the proportion of people with different combinations of disease for their age at the baseline. But now I want to find the number of people having different combinations of diseases at each age group. e.g.to count the number of people who at the age of 40-49 years had disease1+disease2, etc. And the proportion would be the proportion they represent of all individuals while at that age.

The output should look as follows:

Disease combination           30-39  40-49  50-59  60-69  70-79
------------------------------------------------------------------
Combinations of length 2       xx%    yy%  ...
flag1+flag2
flag2+flag3
...


length 3

length 4

length 5

Do you have any thoughts how could one do this?

Richard Richard · Accepted Answer · 2020-10-14T01:40:06

The data is somewhat unusual from a diagnoses standpoint, however, if the flags are for diagnosis of a disease or disease family that follows some temporal progression model the data might make sense.

Considerations

The age at disease flag date needs to be separately computed for each flag.
Double pivoting creates wide structure with flags segregated by at_age grouping
TABULATE has built-in percentage calculations.
The mapping of a 'disease flag asserts condition is present' state to it's corresponding disease name is effectuated using a custom format

Example:

Consider the flagging of 5 diagnoses related to the dreaded Werewolf progression.

proc format;
  value agegrp
    30-39 = '30-39'
    40-49 = '40-49'
    50-59 = '50-59'
    60-69 = '60-69'
    70-79 = '70-79'
    80-high = '80 +'
  ;
  invalue agegrp
    '30-39' = 30
    '40-49' = 40
    '50-59' = 50
    '60-69' = 60
    '70-79' = 70
  ;

  * flag1 to flag5 are progression of Werewolf!;
  value $flag_state_to_disease
    flag1_1='Animal Bite'      
    flag2_1='Hallucination'    
    flag3_1='Onychogryphosis'  
    flag4_1='Hypertrichosis'   
    flag5_1='Hyperdontia'
    other=' '
  ;
run;

* generate some sample data;
%macro RandBetween(min, max);
   (&min + floor((1+&max-&min)*rand("uniform")))
%mend;


data have;
  call streaminit(123);
 
  do id = 1 to 10000;
    enrolled = '01jan2000'd + (1 + floor((1+3650-1)*rand("uniform")));
    age_at_enroll = 30 + %RandBetween(0, 49);

    flag1 = rand('uniform') < 0.25;              
    date1 = enrolled + %RandBetween(0, 2500);

    flag2 = rand('uniform') < 0.25;
    date2 = date1 + %RandBetween(0,2500);

    flag3 = rand('uniform') < 0.25;                
    date3 = date2 + %RandBetween(0,2500);

    flag4 = rand('uniform') < 0.25;
    date4 = date3 + %RandBetween(0,2500);

    flag5 = rand('uniform') < 0.25;
    date5 = date4 + %RandBetween(0,2500);
    output;
  end;

  * force a 5 disease situation for each age group;
  enrolled = '01jan2000'd;
  do age_at_enroll = 30 to 70 by 10;
    flag1=1; flag2=1; flag3=1; flag4=1; flag5=1; 
    date1=enrolled+10; date2=date1+10; date3=date2+10; date4=date3+10; date5=date4+10;
    output;
    id + 1;
  end;

  format enrolled date: yymmdd10. flag: 1.;
run;

* pivot to tall structure;
data tall(keep=id at_age disease);
  set have;
  array dates date1-date5;
  array flags flag1-flag5;

  * row wise transposition of flags as disease names and computed at_age;
  do _n_ = 1 to dim(dates);
    at_age = age_at_enroll + intck('year', enrolled, dates(_n_));
    flag_state = catx('_', vname(flags(_n_)), flags(_n_));
    disease = put(flag_state, flag_state_to_disease.);
    output;
  end;
run;

* pivot back to wide structure, segregating within id the at_age groups;
proc transpose data=tall out=wide1 (label='diseases per id agegroup') prefix=disease;
  by id at_age;
  var disease;
  format at_age agegrp. ;
run;

* computed values for tabulation;
data wide2(keep=at_age disease_count disease_list);
  set wide1;
  disease_count = 5 - cmiss(of disease1-disease5);
  length disease_list $100;
  disease_list = coalescec (catx(', ', of disease1-disease5), '* NONE *');
run;

ods html file='tabulation.html' style=plateau;
title;

proc tabulate data=wide2;
  class disease_count disease_list at_age;
  table 
    disease_count*disease_list
    ,
    at_age * (n*f=comma9. colpctn)
    /
    nocellmerge
  ;
run;
ods html close;

HTML output image

How to summarize different combinations of diseases for different age groups?

2 Answers