0
votes

I'm experiencing some very strange behaviour. I have a CSV data set with lots of columns (3199) and 148 rows/observations. Only some of the variables are of interest, but I've parsed out and manipulated what I need only to find a many of the variables have values that are all missing. Looking in the CSV directly (where I can) and looking at the data into R confirms the data are NOT missing.

My DATA step looks like this:

 DATA WORK.MYFILE;
  INFILE '\\[SERVER NAME]\[FOLDERS]\RawData.csv'
  lrecl=32760
  DSD
  TRUNCOVER
  FIRSTOBS=3  ;
INPUT
    F1               : $ CHAR6.
    ACAAAE31RR       : ?? BEST32.
    ACAAAE32RR       : ?? BEST32.
    ACAAAE33RR       : ?? BEST32.
    ACAAAE3BRR       : ?? BEST32.

 /* lots and lots of lines like this */

    SHAW5564TT3R     : ?? BEST32.
    SHAW6599TT3R     : ?? BEST32.
    SHAX0099TTAR     : ?? BEST32. ;
 RUN;

Everything seemed fine until I pulled out the data I wanted and realized this forced some missing variables. My guess is that after some column (I don't know which) there's only missing values all the way to the end of the observation row, and this maybe turns later columns to missing?? (Clearly I'm new to SAS).

BACKGROUND

In case it's relevant.

1) The data has many missing observations... most of them in the first several dozen rows. (The data are time series starting in 1980Q1 and the variables I want are missing before 2000Q1.)

2) I'm on SAS EnterpriseGuide. I got the above code by using the IMPORT wizard, copying the code into a program, and then removing the lines of formatting code above INFILE and changing the path to the actual CSV rather than the temporary one SAS creates with the wizard. When I did this, I did a find and replace of all the $CHAR1. and other misspecified informats and changed everything to BEST32.

3) My hunch about forced-to-missing-ness after some column is because all the variables are listed alphabetically. The variables I want look like:ACAINDIRR, AEDINDIRR, BVAINDIRR, BVIINDIRR, MWIINDIRR, OHAINDIRR, OOHINDIRR. The first three variables each have 104 non-missing observations as expected. The fourth (BVI...) has 43 non-missing observations and the rest have zero non-missing observations.

4) I've tried a bunch of trouble-shoot things to fixed this. The only thing that got close was converting the CSV to at *.txt file. The fourth variable, BVIINDIRR, then had 104 non-missing observations, oddly. But, the alphabetically later ones all still have 0 non-missing obs.

Any help is greatly appreciated!!

EDIT

Removing the ?? from the INPUT commands doesn't result in any errors or warnings. I do not get any warnings about characters being converted to numeric. The only notes in the Log are:

  NOTE: The infile  '\\[SERVER NAME]\[FOLDERS]\RawData.csv'is:
  Filename= \\[SERVER NAME]\[FOLDERS]\RawData.csv,
  RECFM=V,LRECL=32760,File Size (bytes)=23765791,
  Last Modified=03Aug2016:08:50:21,
  Create Time=03Aug2016:10:52:33

NOTE: 148 records were read from the infile  '\\[SERVER NAME]\[FOLDERS]\RawData.csv'.
  The minimum record length was 32294.
  The maximum record length was 32760.
  One or more lines were truncated.
NOTE: The data set WORK.PARETO has 148 observations and 31998 variables.
NOTE: MVA_DSIO.OPEN_CLOSE| _DISARM|         STOP| _DISARM| 2016-08-05T08:47:01,746-04:00| _DISARM| WorkspaceServer| _DISARM| SAS| 
  _DISARM| | _DISARM| 148| _DISARM| 28418048| _DISARM| 10| _DISARM| 11| _DISARM| 66861615| _DISARM| 569154881| _DISARM| 
  0.296875| _DISARM| 0.672000| _DISARM| 1786020421.075000| _DISARM| 1786020421.747000| _DISARM| 0.281250| _DISARM| | _ENDDISARM 
NOTE: PROCEDURE| _DISARM|         STOP| _DISARM| 2016-08-05T08:47:01,746-04:00| _DISARM| WorkspaceServer| _DISARM| SAS| _DISARM| | 
  _DISARM| 682061824| _DISARM| 26718208| _DISARM| 10| _DISARM| 11| _DISARM| 72237092| _DISARM| 569155157| _DISARM| 4.750000| 
  _DISARM| 7.937000| _DISARM| 1786020413.810000| _DISARM| 1786020421.747000| _DISARM| 2.453125| _DISARM| | _ENDDISARM 
 NOTE: DATA statement used (Total process time):
  real time           7.93 seconds
                        08:45 Friday, August 5, 2016

  cpu time            4.75 seconds
2
Remove the ?? from the code to have your errors generate so you have an idea of what's going wrong and where.Reeza
@Reeza - Didn't result in any errors or warnings. I editted the question to include the notes.Tony Beans
@TonyBeans you need to increase the LRECL so that you no longer see this message "One or more lines were truncated." You could set it very high like LRECL=1M. You will be able to determine the actual LRECL from the message "The maximum record length was ..." once you no longer see the truncation message.data _null_
@RobertPenridge Maximum lrecl is whatever your OS allows. 32767 is the maximum if you want to use _infile_ (as that is a character variable subject to character variable limits).Joe
@Joe Careful you're showing your age there ;-) Yeah I only started using SAS just before SAS 9 came out so I remember very few specifics about SAS 8 (other than it was horribly inferior to SAS 9).Robert Penridge

2 Answers

1
votes

Your LRECL is clearly insufficient. Assuming a non-UTF codepage file, your file is 23 million bytes but only 148 rows. That is an average of around 160000 characters per line. LRECL=32767 is not going to cut that.

Increase your LRECL (as suggested in comments, lrecl=1M is probably reasonable) and see if that solves your problem, or if you have other issues you need to solve with the specific character variables.

1
votes

Try reading the data from the missing columns in as character variables, using an informat such as $CHAR200 (something with a long length so you're less likely to be truncating it - with $CHAR1 everything after the first character won't show up). My guess is that all of your columns that are missing are character data, or at least have a character in them, while SAS is busy looking for something numeric - because you're telling it to when you specify best32.

Does the log happen to give you a note about character values being converted to numeric values? If not, what notes go along with the datastep you reference?