1
votes

For our analysis we need to read raw data from csv (xls) & convert it into SAS dataset before doing our analysis.

Now, the problem is this raw data generally have 2 issues: 1. The ordering of columns changes sometimes. So, if in the earlier period we have columns in order of variable A,then B, then C, etc. It might change to B, then C, then A. 2. There are foreign elements like "#", or ".", or "some letters", etc. Now, we have to first clean the raw data, before reading into SAS. This take considerable amount of time. Is there any way we can clean the data within SAS system itself before reading the data. If we can rectify the data with SAS code, it will save quite amount of time.

Here's the example:

Period 1: I got the data in Data1.csv in this format. In column B, which is numeric, I've "#" & ".". And colummn C, which is also numeric, I've "g". If I import Data1.csv using either PROC IMPORT or Infile statement, these foreign elements in column B & C will remain. The question here is how to do that? I can use If STATEMENT. But the problem is there are too many foreign elements (e.g. instead of "#", ".", "g", I might get other foreign elements like "$", "h" etc.) If there's any way we can have a code which detect & remove foreign elements without I've to specifying it using IF STATEMENT everytime I import the raw data in SAS.

   A    B   C
Name1   1   5
Name2   2   6
Name3   3   4
Name4   #   g
Name5   5   3
Name6   .   6

Period 2: In this period I got DATA2.csv which is given below. When I use INFILE statement, I specify 1st A should be read with the specific name, then B with specific name & then C. In 2nd period when I get the data B is given 1st. So, when SAS read the data I've B instead of A. So, I've to check the variables ordering with previous phase data everytime & correct it before reading the data using infile statement. Since the number of variables are too large, it's very time consuming ( & at time frustrating) to verify the column ordering in this fashion. Is there SAS code, with which SAS will automatically read A,& then B & then C, even though it's not in this order?

B   A   C
1   Name1   5
2   Name2   6
3   Name3   4
#   Name4   g
5   Name5   3
.   Name6   6

Even though I mainly use SAS in my analysis purpose. But I can use R to clean the data, then use to read it in SAS for further analysis. So R code can also be helpful.

Thanks.

2

2 Answers

3
votes

In R you increase the speed of file reading when you specify that a column is a particular class. With the example provided (3 columns with the middle one being "character" you might use this code:

 dat <- read.csv( filename, colClasses=c("numeric", "character", "numeric"), comment.char="")

The "#" and "." would become NA values when encountered in the numeric columns. The above code removes the default specification of the comment character which is "#". If you wanted the "#" and "." entries in character columns to be coerced to NA_character_, you could use this code:

dat <- read.csv( filename, 
                 colClasses=c("numeric", "character", "numeric"),
                 comment.char="",
                 na.strings=c("NA", ".", "#") )

By default the header=TRUE setting is assumed by read.csv(), but if you used read.table() you would need to assert header=TRUE with the two file structures you showed. There is further documentation and worked examples of reading Excel data here: However, my advice is to do as you are planning and use CSV transfer. You will see the screwy things Excel does with dates and missing values more quickly that way. You would be well advised to change the data formats to a custom "yyyy-mm-dd" in agreement with the POSIX standard, in which case you can also specify a "Date" classed column and skip the process of turning character classed columns in the default Excel formats (all of which are bad) into dates.

2
votes

Yes, you can use SAS to do any kind of "data cleaning" you might imagine. The SAS DATA step language is full of features to do things like this, but there is no magic bullet; you need to write the code yourself.

A csv file is just a plain text file (very different from an xls file). Normally the first row in a csv file contains column names and the data begins with the second row. If you use PROC IMPORT, SAS will use the first row to construct variable names and try to determine data types by scanning the first several rows of the file. For example:

proc import datafile='c:\temp\somefile.csv'
     out=SASdata
     dbms=csv replace;
run;

Alternatively, you can read the file with a data step. This would require that you know the file layout in advance. For example:

data SASdata;
   infile 'c:\temp\somefile.csv' dsd firstobs=2 lrecl=32767 truncover;
   informat A $50.; /* A character variable with max length 50 */
   informat B yymmdd10.; /* A date presented like 2012-08-25 */
   informat C dollar12.; /* A number containing dollar sign, commas, or decimals */

   input A B C;  /* The order of the variables in the file */

   if B = . then B = today(); /* A possible data cleaning statement */
run;

Note that the INPUT statement controls the order that the variables exist in the file. The point is that the code you use must match the layout of each file you process.

These are just general comments. If you encounter problems, post back with a more specific question.

UPDATE FOR UPDATED QUESTION: The variables from the raw data file must be listed in the INPUT statment in the same order as they existin each file. Also, you need to define the column types directly, and establish whatever rules they need to follow. There is no way to do this automatically; each file much be treated separately.

In this case, let's assume your variables are A, B, and C, where A is character and B and C are numbers. This program might process both files and add them to a history dataset (let's say ALLDATA):

data temp;
   infile 'c:\temp\data1.csv' dsd firstobs=2 lrecl=32767 truncover;
   /* Define dataset variables */
   informat A $50.;
   informat B 12.;
   informat C 12.;
   /* Add a KEEP statement to keep only the variables you want */
   keep A B C;

   input A B C;
run;
proc append base=ALLDATA data=temp;
run;
data temp;
   infile 'c:\temp\data2.csv' dsd firstobs=2 lrecl=32767 truncover;
   informat A $50.;
   informat B 12.;
   informat C 12.;

   input B A C;
run;
proc append base=ALLDATA data=temp;
run;

Notice that the "data definition" part of each data step is the same; the only difference is the order of variables listed in the INPUT statement. Notice that because the variables A and B are defined as numeric, when those invalid characters are read (# and g), the values are stored as missing values.

In your case, I'd create a template SAS program to define all the variables you want in the order you expect them to be. Then use that template to import each file using the order of the variables in that file. Setting up the template program might take a while, but to run it you would only need to modify the INPUT statement.