2
votes

DATA STRUCTURE: I have a data set that can be read as an Excel or CSV file. It has the following variable types: dates, time, numeric variables, and what should be numeric variables that incorrectly have characters attached to a number - e.g. -0.011* and 0.023954029324) (the parentheses at the end is in the cell) - due to an error in the program that wrote the file. There are also blank lines between every record, and it is not realistic to delete all of these as I have hundreds of files to manage.

DATA ISSUE: We've determined that some values are correct up to the character (i.e. -0.011 is correct as long as the asterisk is removed), while other values, such as 0.023954029324), are incorrect altogether and should be made missing. Please don't comment on this issue as it is out of my control and at this point all I can do is manage the data until the error is fixed and character values stop being written into the files.

PROBLEM WITH SAS:

1) If I use PROC IMPORT with an Excel file, SAS uses the first eight lines (20 for a CSV file) to determine if a variable is numeric or character. If the asterisk of parenthesis doesn't occur within the first 20 lines, SAS says the variable is numeric, then makes any later cells w/ character values missing. This is NOT okay in the case of the asterisks, because I want to maintain the numeric portion of the value and remove the asterisk in a later data step. Importing Excel files with PROC IMPORT does not allow the GUESSINGROWS option (as it does w/ CSV files, see below). Edit: Also, the MIXED=YES option does NOT work (see comments below - still need to change number of rows SAS uses, which, to me, means this option does...what?).

2) If I use PROC IMPORT with a CSV file, I can specify GUESSINGROWS=32767 and I get really excited because it then determines the variables with the asterisks are character and maintains the asterisks. However, it very strangely no longer determines the variables with parentheses as character (as it would have done when importing an Excel file as long as the parenthesis was in the first 20 lines), but instead removes the character and additionally rounds the value to the nearest whole number (0.1435980234 becomes 0, 1.82149023843 becomes 2, etc.). This is way too coarse of rounding - I need to maintain the decimal places. And, on top of that, the parentheses are now gone so I can't make the appropriate cells missing. I do not know if there is a way to make SAS not round and/or maintain the parentheses. To me, this is inconsistent behavior - why is the asterisk but not a parenthesis considered a character in this case? Also, when I read in the Excel file w/ PROC IMPORT (as described in (1)), it can cope w/ the parentheses (if they appear in the first 20 lines) - another inconsistency.

3) If I use INFILE, well - I get an error w/ every variable I try to read in - this procedure is way too sensitive and unstable for how varying the data are (and I have to code a work-around for the blank data lines).

ULTIMATE GOAL (note this code will be run automatically within a macro, if that matters):

1) Read date variable as a date

2) Read time variable as time

3) Be able to identify a variable w/ characters present in any cell of that variable (even after 20 lines) as a character variable and maintain the values in the cells (i.e. don't round/delete character). This can be by a priori telling SAS to let a certain set of variables be character (I will change them to numeric after I get rid of characters/make cells missing), or by SAS identifying variables w/ characters on its own.

1

1 Answers

1
votes
  1. SAS actually by default uses the first 8 rows. That is defined in a registry setting, TYPEGUESSROWS - which is normally stored in HKLM\Software\Microsoft\Office\14.0\Access Connectivity Engine\Engines\Excel\TypeGuessRows\ (or insert-your-office-version-there). Change that value to FFFF (hex)/65536 (decimal) or some other large number, or zero to search the maximum number of rows (a bit over 16000 - exact number is hard to find).

  2. For a CSV file, you can write a data step import to control the formats of each variable. The easiest way to see this is to run the PROC IMPORT, then check your log; the log will contain the complete code used to read in the file in a data step. Then just modify the informats as needed. You say you have too much trouble with Infile method, so perhaps this won't work for you, but typically you can work around any inconsistencies - and if your files are THAT inconsistent it sounds like you'll be doing a ton of manual work anyway. This gives you the options to read in date/time variables correctly as well.

  3. You also can use PROC IMPORT/CSV to the log, writing the log out to a file, then read THAT in and generate new import code on your own - or even off a proc contents of the generated file, making known modifications.

Not sure what you're asking about with date/time as you don't mention issues with it in the first part of your question.

One additional option you have is to clean out the characters before it's read in (from the CSV). This is pretty simple, if it's truly just numerics and commas (and decimals and negative signs):

data mydata;
infile myfile /*options*/;
input @@;
length infileline $32767; *or your longest reasonable line;
infileline = compress(_infile_,'.-','kd');
run;

data _null_;
set mydata;
file myfile /*options*/ /*or a new file if you prefer */;
put @1 infileline $32767.; *or your longest reasonable line;
run;

Then read that new file in using proc import. I am splitting it into two datasteps so you can see it, but you could combine them into one for ease of running - look up "updating a file in place" in SAS documentation. You could also accomplish this cleaning using OS specific tools; on Unix for example a short awk script could easily remove the misbehaving characters.