0
votes

I have a dataset of patient diagnoses with one diagnosis code per line, resulting in patient diagnoses on multiple lines. Each patient has a unique patientID. I also have age, race, gender, etc. data on these patients.

How do I indicate to SAS when using PROC FREQ, Logistic, Univariate, etc. that they are the same patient?

This is an example of what the data looks like:

patientID diagnosis age gender  lab
1         15.02     65    M      positive
1         250.2     65    M      positive
2         348.2     23    M      negative
2         282.1     23    M      negative
3                   50    F      positive

I was given data on every patient who has had a certain lab (regardless of positive result), as well as all of their diagnoses, which each appear on a different line (as a different observation to SAS). First, I will need to exclude every patient who has a negative result for the lab, which I plan on using an IF statement for. The lab determines if the patient has disease X. Some patients do not have any additional diseases, other than disease X, such as patient #3.

Analyses I would like to perform:

  1. Calculate the frequency of each disease using PROC FREQ.
  2. Characterize the age and race relationships for each diagnosis using PROC FREQ chi square.
  3. PROC Logistic to determine risk factors (age, race, gender, etc.)for developing an additional disease on top of disease X.

Thanks!

1
Depends. In some cases it could be considered a repeat measurement and sometimes not. Mostly though you need to factor that in yourself. Your question references several procs so right now it's broad and we can't offer a single answer. If you narrow the question we can provide some examples on how to deal with data like this.Reeza
@Reeza, is it possible to combine the diagnoses for each patient on the same line in the data step? I am not treating these as multiple measurements, as I'm doing a purely cross-sectional analysis.ybao
It is possible to transpose the dataset so that it would have one record per patient, and variables Diagnosis1 Diagnosis2 ... DiagnosisN. But usually the current structure is easier to work with. As Reeza said, it will be easier for people to help you if you can describe a specific analysis you would like to perform.Quentin
@Quentin I have added the analyses I would like to perform.ybao
I've worked with clinical data for years. Use the long format, it'll make your life easier in the long run. The DO loops that will be required otherwise will make your code more clinch and difficult in the long run. Many beginners want it 'wide' because it's easier to understand, but experience will tell you it's a trap.Reeza

1 Answers

2
votes

The answer to your question is you cannot by default. But when you're processing the data you can account for it easily. IMO keeping it long is easier.

You've asked too many questions above so I'll answer just one, how to count the number of people with disease x.

Proc sort data = have out = unique_disease_patient nodupkey;
 By patientID Diag;
Run;


Proc freq data = unique_disease_patient noprint;
Table disease  / out = disease_patient_count;
Run; 

Note that this is much easier in SQL

 Proc sql;
 Create table want as
 Select diag, count(distinct patientID) 
 From have
 Group by diag;
 Quit;

I'm assuming this is homework because you're unlikely to do this in practice except for exploratory analysis.