1
votes

We have a study of people who have been enrolled to a study at different time points and from different age groups. They have been followed up for two decades and during this time they have developed 1-5 diseases. The diseases are developing at different time points. Here is the code for an example data in SAS:

proc format;
  value agegrp
    30-39 = '30-39'
    40-49 = '40-49'
    50-59 = '50-59'
    60-69 = '60-69'
    70-79 = '70-79'
  ;
  invalue agegrp
    '30-39' = 30
    '40-49' = 40
    '50-59' = 50
    '60-69' = 60
    '70-79' = 70
  ;
run;

* generate some sample data;
%macro RandBetween(min, max);
   (&min + floor((1+&max-&min)*rand("uniform")))
%mend;


data have;
  call streaminit(123);
 
  do id = 1 to 10000;
    enrolled = '01jan2000'd + (1 + floor((1+3650-1)*rand("uniform")));
    age = 30 + %RandBetween(0, 49);

    flag1 = rand('uniform') < 0.25;
    date1 = enrolled + %RandBetween(0, 2500);

    flag2 = rand('uniform') < 0.25;
    date2 = date1 + %RandBetween(0,2500);

    flag3 = rand('uniform') < 0.25;
    date3 = date2 + %RandBetween(0,2500);

    flag4 = rand('uniform') < 0.25;
    date4 = date3 + %RandBetween(0,2500);

    flag5 = rand('uniform') < 0.25;
    date5 = date4 + %RandBetween(0,2500);
    output;
  end;
 format enrolled date: yymmdd10. flag: 1.;
run;

I have summarized the proportion of people with different combinations of disease for their age at the baseline. But now I want to find the number of people having different combinations of diseases at each age group. e.g.to count the number of people who at the age of 40-49 years had disease1+disease2, etc. And the proportion would be the proportion they represent of all individuals while at that age.

The output should look as follows:

Disease combination           30-39  40-49  50-59  60-69  70-79
------------------------------------------------------------------
Combinations of length 2       xx%    yy%  ...
flag1+flag2
flag2+flag3
...


length 3

length 4

length 5

Do you have any thoughts how could one do this?

2

2 Answers

1
votes

The data is somewhat unusual from a diagnoses standpoint, however, if the flags are for diagnosis of a disease or disease family that follows some temporal progression model the data might make sense.

Considerations

  • The age at disease flag date needs to be separately computed for each flag.
  • Double pivoting creates wide structure with flags segregated by at_age grouping
  • TABULATE has built-in percentage calculations.
  • The mapping of a 'disease flag asserts condition is present' state to it's corresponding disease name is effectuated using a custom format

Example:

Consider the flagging of 5 diagnoses related to the dreaded Werewolf progression.

proc format;
  value agegrp
    30-39 = '30-39'
    40-49 = '40-49'
    50-59 = '50-59'
    60-69 = '60-69'
    70-79 = '70-79'
    80-high = '80 +'
  ;
  invalue agegrp
    '30-39' = 30
    '40-49' = 40
    '50-59' = 50
    '60-69' = 60
    '70-79' = 70
  ;

  * flag1 to flag5 are progression of Werewolf!;
  value $flag_state_to_disease
    flag1_1='Animal Bite'      
    flag2_1='Hallucination'    
    flag3_1='Onychogryphosis'  
    flag4_1='Hypertrichosis'   
    flag5_1='Hyperdontia'
    other=' '
  ;
run;

* generate some sample data;
%macro RandBetween(min, max);
   (&min + floor((1+&max-&min)*rand("uniform")))
%mend;


data have;
  call streaminit(123);
 
  do id = 1 to 10000;
    enrolled = '01jan2000'd + (1 + floor((1+3650-1)*rand("uniform")));
    age_at_enroll = 30 + %RandBetween(0, 49);

    flag1 = rand('uniform') < 0.25;              
    date1 = enrolled + %RandBetween(0, 2500);

    flag2 = rand('uniform') < 0.25;
    date2 = date1 + %RandBetween(0,2500);

    flag3 = rand('uniform') < 0.25;                
    date3 = date2 + %RandBetween(0,2500);

    flag4 = rand('uniform') < 0.25;
    date4 = date3 + %RandBetween(0,2500);

    flag5 = rand('uniform') < 0.25;
    date5 = date4 + %RandBetween(0,2500);
    output;
  end;

  * force a 5 disease situation for each age group;
  enrolled = '01jan2000'd;
  do age_at_enroll = 30 to 70 by 10;
    flag1=1; flag2=1; flag3=1; flag4=1; flag5=1; 
    date1=enrolled+10; date2=date1+10; date3=date2+10; date4=date3+10; date5=date4+10;
    output;
    id + 1;
  end;

  format enrolled date: yymmdd10. flag: 1.;
run;

* pivot to tall structure;
data tall(keep=id at_age disease);
  set have;
  array dates date1-date5;
  array flags flag1-flag5;

  * row wise transposition of flags as disease names and computed at_age;
  do _n_ = 1 to dim(dates);
    at_age = age_at_enroll + intck('year', enrolled, dates(_n_));
    flag_state = catx('_', vname(flags(_n_)), flags(_n_));
    disease = put(flag_state, flag_state_to_disease.);
    output;
  end;
run;

* pivot back to wide structure, segregating within id the at_age groups;
proc transpose data=tall out=wide1 (label='diseases per id agegroup') prefix=disease;
  by id at_age;
  var disease;
  format at_age agegrp. ;
run;

* computed values for tabulation;
data wide2(keep=at_age disease_count disease_list);
  set wide1;
  disease_count = 5 - cmiss(of disease1-disease5);
  length disease_list $100;
  disease_list = coalescec (catx(', ', of disease1-disease5), '* NONE *');
run;

ods html file='tabulation.html' style=plateau;
title;

proc tabulate data=wide2;
  class disease_count disease_list at_age;
  table 
    disease_count*disease_list
    ,
    at_age * (n*f=comma9. colpctn)
    /
    nocellmerge
  ;
run;
ods html close;

HTML output image

enter image description here

1
votes

Consider creating indicator columns by concatenating all flags together where 0_0_0_0_0 indicates no diseases and 1_1_1_1_1 indicates all five diseases and in between include all possible combinations. Then join different aggregation levels for numerator and denominator of percentages, all while creating conditional columns.

data

Add indicator columns

data have;
    set have;
    select;
        when (age >= 30 & age <= 39) age_group='30-39';
        when (age >= 40 & age <= 49) age_group='40-49';
        when (age >= 50 & age <= 59) age_group='50-59';
        when (age >= 60 & age <= 69) age_group='60-69';
        when (age >= 70 & age <= 79) age_group='70-79';
        otherwise age_group='';
    end;

    length = catx('_', flag1, flag2, flag3, flag4, flag5);
    format length $char10.;
run;

proc sql

Run different level aggregation with conditional columns

proc sql;
    create table final_table as 
    select len.length format=$char10.
           , SUM(case when len.age_group = '30-39' then len.length_count / age.age_count else . end) as Pct_30_39 format=percent10.2
           , SUM(case when len.age_group = '40-49' then len.length_count / age.age_count else . end) as Pct_40_49 format=percent10.2
           , SUM(case when len.age_group = '50-59' then len.length_count / age.age_count else . end) as Pct_50_59 format=percent10.2
           , SUM(case when len.age_group = '60-69' then len.length_count / age.age_count else . end) as Pct_60_69 format=percent10.2
           , SUM(case when len.age_group = '70-79' then len.length_count / age.age_count else . end) as Pct_70_79 format=percent10.2

    from 
       (select h.age_group
               , h.length
               , COUNT(*) as length_count
        from have h
        group by h.age_group
                 , h.length) as len
    inner join
       (select h.age_group
               , COUNT(*) as age_count
        from have h
        group by h.age_group) as age
      on len.age_group = age.age_group

     group by len.length;
quit;

Output

Total rows is 32 as confirmed by sum of combinations:

550 + 551 + 552 + 553 + 554 + 555 = 1 + 5 + 10 + 10 + 5 + 1 = 32

Table Output

To check, all columns equal 1

proc means data = final_table sum; run;

enter image description here