Reading Raw Data in SAS Or R

Question

For our analysis we need to read raw data from csv (xls) & convert it into SAS dataset before doing our analysis.

Now, the problem is this raw data generally have 2 issues: 1. The ordering of columns changes sometimes. So, if in the earlier period we have columns in order of variable A,then B, then C, etc. It might change to B, then C, then A. 2. There are foreign elements like "#", or ".", or "some letters", etc. Now, we have to first clean the raw data, before reading into SAS. This take considerable amount of time. Is there any way we can clean the data within SAS system itself before reading the data. If we can rectify the data with SAS code, it will save quite amount of time.

Here's the example:

Period 1: I got the data in Data1.csv in this format. In column B, which is numeric, I've "#" & ".". And colummn C, which is also numeric, I've "g". If I import Data1.csv using either PROC IMPORT or Infile statement, these foreign elements in column B & C will remain. The question here is how to do that? I can use If STATEMENT. But the problem is there are too many foreign elements (e.g. instead of "#", ".", "g", I might get other foreign elements like "$", "h" etc.) If there's any way we can have a code which detect & remove foreign elements without I've to specifying it using IF STATEMENT everytime I import the raw data in SAS.

   A    B   C
Name1   1   5
Name2   2   6
Name3   3   4
Name4   #   g
Name5   5   3
Name6   .   6

Period 2: In this period I got DATA2.csv which is given below. When I use INFILE statement, I specify 1st A should be read with the specific name, then B with specific name & then C. In 2nd period when I get the data B is given 1st. So, when SAS read the data I've B instead of A. So, I've to check the variables ordering with previous phase data everytime & correct it before reading the data using infile statement. Since the number of variables are too large, it's very time consuming ( & at time frustrating) to verify the column ordering in this fashion. Is there SAS code, with which SAS will automatically read A,& then B & then C, even though it's not in this order?

B   A   C
1   Name1   5
2   Name2   6
3   Name3   4
#   Name4   g
5   Name5   3
.   Name6   6

Even though I mainly use SAS in my analysis purpose. But I can use R to clean the data, then use to read it in SAS for further analysis. So R code can also be helpful.

Thanks.

IRTFM IRTFM · Accepted Answer · 2012-08-25T19:48:46

In R you increase the speed of file reading when you specify that a column is a particular class. With the example provided (3 columns with the middle one being "character" you might use this code:

 dat <- read.csv( filename, colClasses=c("numeric", "character", "numeric"), comment.char="")

The "#" and "." would become NA values when encountered in the numeric columns. The above code removes the default specification of the comment character which is "#". If you wanted the "#" and "." entries in character columns to be coerced to NA_character_, you could use this code:

dat <- read.csv( filename, 
                 colClasses=c("numeric", "character", "numeric"),
                 comment.char="",
                 na.strings=c("NA", ".", "#") )

By default the header=TRUE setting is assumed by read.csv(), but if you used read.table() you would need to assert header=TRUE with the two file structures you showed. There is further documentation and worked examples of reading Excel data here: However, my advice is to do as you are planning and use CSV transfer. You will see the screwy things Excel does with dates and missing values more quickly that way. You would be well advised to change the data formats to a custom "yyyy-mm-dd" in agreement with the POSIX standard, in which case you can also specify a "Date" classed column and skip the process of turning character classed columns in the default Excel formats (all of which are bad) into dates.

Reading Raw Data in SAS Or R

2 Answers