0
votes

self-taught SAS user here.

I often work with datasets that I have little control over and are shared among several different users.

I generally have been reading in files as CSVs using an infile statement + defining the variables with blocks of informat, format, and input statements. During this process, can I go ahead and rename variables--provided that everything is renamed in the correct order--or do they have to match the original dataset and be renamed in a later data step?

For example, the variable name in the dataset is '100% Fully Paid Out.' I know SAS variables can't start with numbers and I'd also like to simplify variable names in general, so could I do something like the following:

infile statement...
informat Paid $3.;
format Paid $3.;
input Paid $;
run;

Or maybe I'm going about this very inefficiently. I've tried doing simple proc imports without this whole informat/format/input business, but I've found that trying to redefine variable types afterwards causes more of a headache for me (all datasets I work with have combinations of text, dollars, percentages, general numbers, dates...). In any case, other tips highly appreciated--thanks!

EDIT

Maybe the question I should ask is this: is there any way of keeping the format of the csv for dollars and percentages (through proc import, which seems to convert these to characters)? I know I can manually change the formats from dollars/percentages to "general" in Excel prior to importing the file, but I'd prefer avoiding additional manual steps and also because I actually do want to keep these as dollars and percentages. Or am I just better off doing the informat/format/input to specify data types for the csv, so that variables are read in exactly how I want them to be read in?

Note: I've been unable to proc import xls or xlsx files, either because I'm on a 64-bit computer and/or I'm missing required drivers (or both). I was never able to do this even on a 32-bit computer either.

1
Thanks for replying so quickly, Tom. I'm not sure I fully understand but I'm going to amend my initial question.jh1021
I think Tom's answer still applies. Maybe if you could give example CSV (as text in a program box not a picture) just a few records will be enough and someone can show you the exact code you need to read the example and you can extrapolate that to your CSVs.data _null_
No, there isn't a faster way that would give you the same accuracy. Manual control is best. One thing you can do, if you deal with this often, is create a specification file format that you can create, in Excel or a CSV that specifies the formats/informats, types, lengths, names and then create a macro to use the specification file to import your data.Reeza

1 Answers

2
votes

CSV files do not contain any metadata about the variable types, as your note about trying to import them into Excel demonstrates. You can use PROC IMPORT to have SAS make an educated guess as to how to read them, but the answer could vary from file to file based on the particular data values that happen to appear.

If you have data in XLS or XLSX files you should be able to read them directly into SAS using a libname with the XLS or XLSX engine. That does not use Excel and so does not have any conflicts between 32 and 64 installation. In fact you don't even need Excel installed. SAS will do a better job of determining the variable types from Excel files than from CSV files, but since Excel is a free-form spreadsheet you still might not have consistent variable types for the same variable across multiple files. With an Excel spreadsheet you might not even have the same data type consistently in a single column of a single sheet.

You are better off writing your own data step to read the file. That way you can enforce consistency.

What I typically do when given a CSV file is copy the names from the first row and use it to create a LENGTH statement. This will both define the variables and set the order of the variables. You could at this point give the variables new names.

length paid $3 date amount 8 ;

Then for variables that require an INFORMAT to be read properly I add an INFORMAT statement. Normally this is only needed for date/time variables, but it might also be needed if numeric values include commas or percent signs. The DOLLAR. informat is useful if your CSV file has numbers formatted with $ and/or thousands separators.

informat date mmddyy. amount dollar. ;

Then for variables that require a FORMAT to be displayed properly I add a FORMAT statement. Normally this is only needed for date/time variables. It is only required for character variables if you want to attach $CHAR. format in order to preserve leading spaces.

format date yymmdd10. ;

Then the INPUT statement is really easy since you can use a positional variable list. Note that there is no need to include informats or $ in the INPUT statement since the types are already defined by the LENGTH statement.

input paid -- amount ;