4
votes

I have a data file that looks like this:

001 Mayo Clinic  120 78 7 15 
Patient has had a persistent cough for 3 weeks
023 Mayo Clinic  157 72 10 2 
Patient complained of ear ache
064 HMC  201 59 . . 
Patient left against medical advice
003 HMC  166 58 8 15 
Patient placed on beta-blockers on 7/1/2006

I am finding the task of reading this into SAS to be basically impossible. And no, in this case, reformatting the data file is out of the question. So let me explain what you are looking at here:

Each subject has two lines of data. The first line is-

subject number / clinic / wt / hr / dx / sx (don't worry about what the numbers mean, thats irrelevant).

The second line is text, which is basically a note containing extra information referring to the subject whose data is laid out in the previous line. So, the lines:

001 Mayo Clinic  120 78 7 15 
Patient has had a persistent cough for 3 weeks

Are for a SINGLE subject. Subject 001. These need to become a single row in a SAS data set. I am completely at a loss; because of the different lengths for the clinic names, and the number columns not being aligned, I can't figure out how to get SAS to read this. This is the closest I have been able to get:

data ClinData;
    infile "&wdir.clinic_data.txt";
    retain patno clinic weight hr dx sx exinfo;
    input patno clinic $1. @;
    if clinic='M' then
        input patno @5 clinic $11. weight hr dx sx / @1 exinfo $30.;
    else if clinic='H' then
        input patno @5 clinic $3. weight hr dx sx / @1 exinfo $30.;
    run;

This prints as:

http://i61.tinypic.com/2uswl90.png

All of the numerical values are in the right place.

However, this has a several problems.

First, the subject number ('patno') always shows up as a missing value. Why?

Second, the clinic is only represented by its first letter 'M' or 'H'. I can't get SAS to change the length of the clinic variable based on which clinic it is.

Third, the variable "exinfo" contains the notes about the patient. However, I can't get SAS to include the entire line. The highest I can get it is around 30 characters before the formatting goes haywire.

Any help? The SAS documentation is frustratingly poor for this type of input. None of the examples really match up with what I need, and it doesn't adequately explain how to use some of the options. I know I need to use column/line pointers; but the problem is that the columns aren't consistent from line to line. So no matter which pointer format I use there will still be lines that don't come out right.

3

3 Answers

3
votes

Nothing in SAS is impossible. Looking at your sample data, I notice that there are two blanks after your clinic name and that your patient number is always three characters. If that is always true you can use that to your advantage:

data want;
  length patno $3 clinic $20 weight hr dx sx 8 exinfo $80;
  input;
  patno  = scan(_infile_,1,' ');
  clinic = substr(_infile_,5,index(_infile_,'  ')-5);
  weight = input(scan(_infile_,-4,' '),8.);
  hr     = input(scan(_infile_,-3,' '),8.);
  dx     = input(scan(_infile_,-2,' '),8.);
  sx     = input(scan(_infile_,-1,' '),8.);
  input exinfo $80.;

datalines;
001 Mayo Clinic  120 78 7 15 
Patient has had a persistent cough for 3 weeks
023 Mayo Clinic  157 72 10 2 
Patient complained of ear ache
064 HMC  201 59 . . 
Patient left against medical advice
003 HMC  166 58 8 15 
Patient placed on beta-blockers on 7/1/2006
run;

Basically this is parsing the automatic variable _INFILE_ to read each variable. The "hard" part is fining how to read the clinic name (because it contains imbeded blanks). If the clinic does NOT always have that double-blank, you can still do it with other manipulations of the substr, index, and/or scan functions. I'll leave that to you if that's the case.

Also, when creating new datasets, always define your variables with a length statement to be sure they have the correct length, especially for character variables.

1
votes

You're mixing input types in odd ways that are not allowing you to read this properly.

Your clinic is 1 long because you're inputting it as one character, which defines it as 1. Don't do that - use a throwaway variable for that if needed - and define its length to something longer.

I recommend an approach like the below. It's easier to work with INFILE (the automatic variable created during input that contains one line of data) rather than to try to just use input techniques alone. Your data is pretty simple; if it's more complicated than you offer (such as you have more clinics than this), regular expressions or other logic may help this further - and infile will be easier to parse. There are also the ANYDIGIT and NODIGIT and similar functions, plus COMPRESS, which may help.

data want;
length clinic $12;
input 
@1 patid 3. @;  *hold input so _infile_ exists and we can play with it.  Might as well read in patid here.;
array numvars weight hr dx sx; *we are going to read this in via array;
do _t = 4 to 1 by -1;  *we are going through the string in backwards order;
 numvars[_t] = scan(_infile_,(_t-5),' '); *(_t-5) is giving us 4 -> -1 3 -> -2 etc.- I include space explicitly here as I think period otherwise might count which is bad;
end;
clinic = scan(_infile_,2); *start out using the 2nd word;
if scan(_infile_,3) = 'Clinic' then clinic=catx(' ',clinic,scan(_infile_,3)); *then maybe add the third word.  Here you could also check if compress(scan(_infile_,3),,'ka') is not missing;
input;
input @1 exinfo $50.;
put _all_;
datalines;
001 Mayo Clinic  120 78 7 15 
Patient has had a persistent cough for 3 weeks
023 Mayo Clinic  157 72 10 2 
Patient complained of ear ache
064 HMC  201 59 . . 
Patient left against medical advice
003 HMC  166 58 8 15 
Patient placed on beta-blockers on 7/1/2006
;;;;
run;
1
votes

most of the issues you were running into were because of lengths that you had explicitly declared. For example Clinic was defined in the initial input statement as $1 and you can't modify the length after the fact as you attempted in the second input line .

this should get you closer to what you were looking for:

data ClinData(drop=s varlen);
  retain patno clinic weight hr dx sx; 

  input patno clinic $30. @;
    clinic=compress(clinic,,'ka');
    s=length(clinic)+4+2;
   input @s weight hr dx sx /@; 
     varlen=length(_infile_); 
    input  @1 exinfo $varying256. varlen;

datalines4;
001 Mayo Clinic  120 78 7 15 
Patient has had a persistent cough for 3 weeks
023 Mayo Clinic  157 72 10 2 
Patient complained of ear ache
064 HMC  201 59 . . 
Patient left against medical advice
003 HMC  166 58 8 15 
Patient placed on beta-blockers on 7/1/2006
;;;;
run; 
proc print data=ClinData; run;