1
votes

I am importing a tab-delimited file in SAS, but there is a possible number of extra tab characters in some of the data lines. In order to verify whether all lines have the same number of tab delimiters, I want to count the number of tab character in each data line. How do I achieve that in SAS? For ordinary characters (for example ";" or "|", the code below works fine, but I don't know how to refer to the tab character, since '09'x does not work inside the count() function:

   DATA output_file;
      drop inrecord;
      INFILE "input_file"
          LRECL=1000
          dlm=';'
      MISSOVER
      DSD
      ;


   length inrecord $1000; 
    input @1 inrecord $1000.;
    dlm_count = count(inrecord,';');
    RUN;
3
Is the real problem that some of the individual values contain tabs? For it to be possible to parse the file then those values that contain the delimiter should be enclosed in quotes. Are the quotes missing in your file?Tom
Yes, my problem was the possibility of extra tabs in some of the string variables, and the absence of enclosing quotes in my dataAguinaldo

3 Answers

1
votes

Use countc().

(Note: you'll need to manually add tabs to the end of the datalines rows as they're converted to spaces on Stack Overflow)

data dr_seuss;
    length line $100.;
    infile datalines dlm=',';
    input line$;

    total_tabs = countc(line, '09'x);

    datalines;
one fish    
two fish        
red fish            
blue fish               
;
run;
1
votes

The Count and the CountC both should work. '09'x should work inside Count.

CountC with modifiers would be more flexible for character delimited data and code that might have to run on both ASCII and EBCDIC platforms.

Use the input; statement alone to read an entire line into the input buffer which is accessed through the automatic variable _infile_.

filename tabdata '....';
data _null_;
  infile tabdata lrecl=32000;

  input;

  tabcount = count(_infile_, '09'x);
  if tabcount ne lag(tabcount) and _n_ > 1 then put 'tabcount changed! ' _n_= tabcount=;
run;

Example of three tab counterings:

data _null_;
  s = "ABC" || "09"x || "DEF" || "09"x || "GHI" || "0909090909"x;

  tabcount = count(s, "09"x);    * count when tab specified as the substring;
  tabcountc = countc(s, "09"x);  * countc with single character, the ascii tab character;
  tabcountc2 = countc(s,,'h');   * countc with modifier h for counting horizontal tabs;

  put (tabcount:) (=/);
run;
----------------- LOG ------------------
tabcount=7
tabcountc=7
tabcountc2=7

Be careful, the Count target is the substring argument, and the CountC single character target(s) are in the charlist argument. If there are multiple delimiters, use CountC.

0
votes

Both COUNT() and COUNTC() will work.

ntabs1=count(_infile_,'09'x);
ntabs2=countc(_infile_,'09'x);

To have any hope of actually fixing the file you need one of two things.

First the extra tabs might have been escaped by being prefixed with a backslash. Just look at one of the bad lines and decide if that is what they did. You could use TRANWRD() function to change any backslash + tab into some other character, like pipe.

data want;
  infile "input_file" dsd dlm='09'x truncover ;
  input @;
  _infile_=tranwrd(_infile_,'\'||'09'x,'|');
  input .... ;
run;

Second if you are certain that there is only one column that could contain the extra tabs then just peel off the values before and after that column and whatever is left is the value of that column.

If neither of these things are true then you might need to manually edit the lines with extra tabs. You could fix it by changing the embedded tabs to some other character or by adding quotes around the values with embedded tabs. You could use the result of the COUNTC() function to tell which lines need editing. For example if there should be 11 fields (so 10 tabs) then a program like this will split the file into two parts.

 data _null_;
   infile "input_file";
   input;
   if countc(_infile_,'09'x)=10 then file "good_file";
   else file "bad_file";
   put _infile_;
run;