I solved this by not using PROC IMPORT
. It isn't a solution for everyone, but it worked great for my purposes (i.e. not "big data"). If you're reading from an Excel spreadsheet, it should work for you.
ImportDataFile
is a macro1 which automates a data step import. A data step import requires a LENGTH
statement to define variable names and types, an INPUT
statement to read raw data from an external file, and an INFILE
statement to specify which file.
data &dataset.;
&infileStatement.;
length &lengthStatement. ;
input (_all_) (:) ;
run;
The macro is composed of three primary steps:
- Establish a DDE link, if necessary (i.e. connect to Excel)
- Get data variables by reading in the header
- Read in the remaining data
Notice how each of those corresponds to the three lines in the data step. Everything in the macro is in support of that data step.
In my experience, it was best to import the data as fixed width characters and then convert to whatever type was needed in a separate step. Yeah, it's redundant, but I never ran into issues with memory or space. The benefits far outweighed any hypothetical concerns. It made the data flow identical for each analysis which aided verification and saved time overall by avoiding the need to correct SAS's guess at the type (and inevitable silent truncation).
Because SAS is a horribly verbose language, this answer runs up against the StackOverflow answer character limit. A fully documented copy is here: https://pastebin.com/raw/RsXz3juJ Put the code in a file named something like ImportDataFile.sas
and make sure it runs (probably using %include
) before the macro is called. The call form is:
%ImportDataFile(
dirData=
, fileName=
, dataset=
, delimiter=
, overOption=
, headerRow=
, sheet=
, range=
, prefix=
, case=
, defLength=
);
where
Output(s) : SAS dataset, macro variable &listHeader
Inputs : dirData= Directory containing data file.
fileName= Filename including file extension. Must be
.csv, .txt, .tsv, .xls, or .xlsx.
dataset= Name of dataset output to WORK library.
delimiter= (optional) Delimiting string given in
quotes. Default for CSV is a comma, for
TXT/TSV a tab. This parameter may not be
set for Excel files. Doing so generates a
warning.
overOption= (optional) INFILE option. Default is
MISSOVER. Other choices are FLOWOVER,
STOPOVER, TRUNCOVER, or SCANOVER.
headerRow= (optional) Row corresponding to header in
an Excel file. Accepts R#C#:R#C#, but
should be given as R#. Default is R1.
sheet= Name of worksheet. Required for XLS or XLSX.
range= Range of spreadsheet to be imported.
Required for XLS and XLSX. Use form
R#C#:R#C#. See example below.
prefix= (optional) String to append to beginning of
each variable name. Default is no prefix.
case= (optional) Toggle mix case variable naming.
Must be lower/upper/mixed. Default is
lower.
defLength= (optional) Character field length. Default
value is 100.
For example, the following creates a dataset named xl_import
of character type with width 100 from a my_xl_file.xlsx
located in C:\Path\To\File
. The columns are prefixed with the string "raw_". The overOption
corresponds to those defined in the INFILE
statement.
%ImportDataFile(
dirData= C:\Path\To\File
, fileName= my_xl_file.xlsx
, dataset= xl_import
, prefix= raw_
, sheet= Sheet1
, range= R2C1:R13C18
, defLength= 100
, overOption= MISSOVER
);
Here is the code for the macro. Enjoy.
********************************************************************
** Utilities / Sub Macros
********************************************************************;
%macro ClearFileRef(fileRef);
filename &fileRef. clear;
%mend;
%macro CompareVariablesToDDERange();
%local columnIndex numberOfDDEColumns;
%let columnIndex = %eval(%sysfunc(findc(&range., 'C', ib)) + 1);
%let numberOfDDEColumns = %sysfunc(substr(&range., &columnIndex));
%if %ListLength(&listHeader) ^= &numberOfDDEColumns %then
%put WARNING: [MACRO] Data file contains %ListLength(&listHeader) variables. RANGE argument has &numberOfDDEColumns columns.;
%mend;
%macro EstablishSystemLink(fileRef);
filename &fileRef. dde 'excel|system';
%mend;
%macro EstablishWorkbookLink(fileRef, dirData, fileName, sheetName, range);
filename &fileRef. dde "excel|&dirData.\[&fileName.]&sheetName.!&range.";
%mend;
%macro IsEmpty(macroVariable);
%sysevalf(%superq(¯oVariable)=, boolean)
%mend;
%macro IsFileRef(reference);
%local fileRefExists externalFileExists returnValue;
%let fileRefExists = %sysfunc(fexist(&reference.));
%let externalFileExists = %sysfunc(fileexist(&reference.));
%if &fileRefExists. = 1 and &externalFileExists. = 0 %then %let returnValue = 1;
%else %let returnValue = 0;
&returnValue
%mend;
%macro IsFilePath(reference);
%local fileRefExists externalFileExists returnValue;
%let fileRefExists = %sysfunc(fexist(&reference.));
%let externalFileExists = %sysfunc(fileexist(&reference.));
%if &fileRefExists. = 0 and &externalFileExists. = 1 %then %let returnValue = 1;
%else %let returnValue = 0;
&returnValue
%mend;
%macro GetObsCount(dataset);
%local exists returnValue closed;
%let exists = %sysfunc(open(&dataset));
%if &exists. %then %do;
%let returnValue = %sysfunc(attrn(&exists, nobs));
%let closed = %sysfunc(close(&exists));
%end;
%else %do;
%put ERROR: [&SYSMACRONAME.] Dataset %upcase(&dataset) does not exist.;
%abort cancel;
%end;
&returnValue
%mend;
%macro GetVarCount(dataset);
%local exists varCount closed;
%let exists = %sysfunc(open(&dataset));
%if &exists. %then %do;
%let varCount = %sysfunc(attrn(&exists, nvars));
%let closed = %sysfunc(close(&exists));
%end;
%else %do;
%put ERROR: [&SYSMACRONAME.] Dataset %upcase(&dataset) does not exist.;
%abort cancel;
%end;
&varCount
%mend;
%macro ListLength(list);
%local count;
%if %sysevalf(%superq(list)=, boolean) %then %let count = 0;
%else %let count = %eval(%sysfunc(countc(&list., |)) + 1);
&count
%mend;
%macro ListElement(list, n);
%local nthElement;
%let nthElement = %sysfunc(scan(%superq(&list.), &n., |, m));
&nthElement
%mend;
%macro RemoveAllFormattingFromSheet(fileRef, sheet);
data _null_;
file &fileRef.;
/* Select sheet of interest */
put "[WORKBOOK.ACTIVATE(""&sheet."")]";
/* Select first cell */
put '[FORMULA.GOTO("R1C1")]';
/* Apply dummy filter of ">2" to first column */
put '[FILTER(1, ">2")]';
/* Disable filters */
put '[FILTER()]';
/* Select all */
put '[SELECT("R[0]C[0]:R[1048575]C[16383]", "R[0]C[0]")]';
/* Unhide rows */
put '[ROW.HEIGHT(,,TRUE, 2)]';
/* Unhide columns */
put '[COLUMN.WIDTH(,,TRUE, 2)]';
/* Remove all formatting */
put '[CLEAR(2)]';
/* Autofit column width */
put '[COLUMN.WIDTH(,,TRUE, 3)]';
run;
%mend;
%macro SetSystemOptions(opt1, opt2, opt3);
options &opt1. &opt2. &opt3.;
%mend;
%macro ImportDataFile(dirData=, fileName=, dataset=, delimiter=, overOption=MISSOVER, headerRow=R1, sheet=, range=, prefix=, case=lower, defLength=100) / minoperator mindelimiter=',';
%put NOTE: [MACRO] Executing: ImportDataFile(dirData=&dirData, fileName=&fileName, dataset=&dataset, delimiter=&delimiter, overOption=&overOption, headerRow=&headerRow, sheet=&sheet, range=&range, prefix=&prefix, case=&case, defLength=&defLength);
%local
macroStart
case
extension
HeaderRef
lengthStatement
delimiter
InfileRef
infileStatement
numberOfRecords
numberOfVars
duration
;
%global
listHeader
originalNOTES
originalQUOTELENMAX
;
%let macroStart = %sysfunc(datetime());
%let originalNOTES = %sysfunc(getoption(notes));
%let originalQUOTELENMAX = %sysfunc(getoption(noquotelenmax));
%SetSystemOptions(nonotes);
********************************************************************
** Validation
********************************************************************;
%if %IsEmpty(dirData) %then %do;
%put ERROR: [&SYSMACRONAME.] DIRDATA argument is blank.;
%SetSystemOptions(&originalNOTES., &originalQUOTELENMAX.);
%abort cancel;
%end;
%if %IsEmpty(fileName) %then %do;
%put ERROR: [&SYSMACRONAME.] FILENAME argument is blank.;
%SetSystemOptions(&originalNOTES., &originalQUOTELENMAX.);
%abort cancel;
%end;
%if %IsEmpty(dataset) %then %do;
%put ERROR: [&SYSMACRONAME.] DATASET argument is blank.;
%SetSystemOptions(&originalNOTES., &originalQUOTELENMAX.);
%abort cancel;
%end;
%if not(%IsEmpty(prefix)) and not(%sysfunc(nvalid(&prefix, v7))) %then %do;
%put ERROR: [&SYSMACRONAME.] Invalid PREFIX="&prefix.";
%SetSystemOptions(&originalNOTES., &originalQUOTELENMAX.);
%abort cancel;
%end;
%let case = %upcase(&case.);
%if not(&case. in (LOWER, UPPER, MIXED)) %then %do;
%put ERROR: [&SYSMACRONAME.] Invalid case option: &case. Must be LOWER, UPPER, MIX.;
%SetSystemOptions(&originalNOTES., &originalQUOTELENMAX.);
%abort cancel;
%end;
%let extension = %upcase(%scan(&fileName., 1, '.', b));
%if not(&extension. in (TXT, TSV, CSV, XLS, XLSX)) %then %do;
%put ERROR: [&SYSMACRONAME.] Invalid file type: &extension. Must be TXT, TSV, CSV, XLS, XLSX.;
%SetSystemOptions(&originalNOTES., &originalQUOTELENMAX.);
%abort cancel;
%end;
%if &extension. in (XLS, XLSX) and %IsEmpty(sheet) %then %do;
%put ERROR: [&SYSMACRONAME.] SHEET argument undefined.;
%SetSystemOptions(&originalNOTES., &originalQUOTELENMAX.);
%abort cancel;
%end;
%if &extension. in (XLS, XLSX) and %IsEmpty(range) %then %do;
%put ERROR: [&SYSMACRONAME.] RANGE argument undefined.;
%SetSystemOptions(&originalNOTES., &originalQUOTELENMAX.);
%abort cancel;
%end;
%if not(&extension. in (XLS, XLSX)) and not(%IsEmpty(sheet)) %then %do;
%put ERROR: [&SYSMACRONAME.] SHEET argument only valid for XLS or XLSX files.;
%SetSystemOptions(&originalNOTES., &originalQUOTELENMAX.);
%abort cancel;
%end;
%if not(&extension. in (XLS, XLSX)) and not(%IsEmpty(range)) %then %do;
%put ERROR: [&SYSMACRONAME.] RANGE argument only valid for XLS or XLSX files.;
%SetSystemOptions(&originalNOTES., &originalQUOTELENMAX.);
%abort cancel;
%end;
**********************************
*** Define delimiter
**********************************;
%if %IsEmpty(delimiter) %then %do;
%if &extension. in (XLS, XLSX) %then %let delimiter = '09'x;
%else %if &extension. = CSV %then %let delimiter = ',';
%else %if &extension. in (TXT, TSV) %then %let delimiter = '09'x;
%else %do;
%put ERROR: [&SYSMACRONAME.] Delimiter error.;
%SetSystemOptions(&originalNOTES., &originalQUOTELENMAX.);
%abort cancel;
%end;
%end;
%if &extension. in (XLS, XLSX) and &delimiter ^= '09'x %then %do;
%put WARNING: [&SYSMACRONAME.] Delimiter for Excel file must be '09'x.;
%put WARNING: [&SYSMACRONAME.] Delimiter set to '09'x.;
%let delimiter = '09'x;
%end;
********************************************************************
** Prep Excel Worksheet
********************************************************************;
%if &extension. in (XLS, XLSX) %then %do;
%let DDECommandRef = DDEcmd;
%EstablishDDELink(fileRef=&DDECommandRef.);
%RemoveAllFormattingFromSheet(fileRef=&DDECommandRef., sheet=&sheet.);
%end;
********************************************************************
** Get header
********************************************************************;
**********************************
*** Define file reference
**********************************;
%if &extension. in (XLS, XLSX) %then %do;
%let HeaderRef = DDEHead;
%EstablishDDELink(
fileRef= &HeaderRef.
, dirData= &dirData.
, fileName= &fileName.
, sheetName= &sheet.
, range= &headerRow.
);
%end;
%else %if &extension. in (CSV, TXT, TSV) %then
%let HeaderRef = %sysfunc(dequote(&dirData.))\&fileName.;
%ReadHeaderIntoList(reference=&HeaderRef., delimiter=&delimiter., prefix=&prefix., case=&case.);
********************************************************************
** Create length statement
********************************************************************;
%let lengthStatement = %CreateLengthStatement(&listHeader., &defLength.);
********************************************************************
** Import data
********************************************************************;
**********************************
*** Define infile statement
**********************************;
%if &extension. in (XLS, XLSX) %then %do;
%let InfileRef = DDESheet;
%EstablishDDELink(
fileRef= &InfileRef.
, dirData= &dirData.
, fileName= &fileName.
, sheetName= &sheet.
, range= &range.
);
%let infileStatement = infile &InfileRef. dlmstr=&delimiter. dsd notab &overOption.;
%CompareVariablesToDDERange();
%end;
%else %if &extension. in (CSV, TXT, TSV) %then %do;
%let InfileRef = %sysfunc(dequote(&dirData.))\&fileName.;
%let infileStatement = infile "&InfileRef." dlmstr=&delimiter. dsd &overOption. firstobs = 2 end=last_record;
%end;
**********************************
*** Perform import
**********************************;
data &dataset.;
&infileStatement.;
length &lengthStatement. ;
input (_all_) (:) ;
run;
********************************************************************
** Housekeeping
********************************************************************;
%let numberOfRecords = %GetObsCount(&dataset.);
%let numberOfVars = %GetVarCount(&dataset.);
%SetSystemOptions(notes);
%put;
%put NOTE: [MACRO] The dataset WORK.%upcase(&dataset.) has &numberOfRecords. observations and &numberOfVars. variables.;
%put NOTE: [MACRO] IMPORTDATAFILE macro used (Total process time):;
%let duration = %sysfunc(putn(%sysevalf(%sysfunc(datetime()) - ¯oStart.), time12.3));
%if %sysfunc(minute("&duration."t)) > 0 %then %do;
%put NO%str(TE-) real time %substr(&duration., 3, 8);
%end;
%else %do;
%put NO%str(TE-) real time %substr(&duration., 6, 5) seconds;
%end;
%put;
%SetSystemOptions(&originalNotes., &originalQUOTELENMAX.);
%mend;
%macro EstablishDDELink(fileRef, dirData, fileName, sheetName, range);
%put NOTE: [&SYSMACRONAME] Executing: EstablishDDELink(fileRef=&fileRef, dirData=&dirData, fileName=&fileName, sheetName=&sheetName, range=&range);
%local dirData linkConnection stopTime closeReturnCode;
********************************************************************
** Validate arguments
********************************************************************;
%if %IsEmpty(fileRef) %then %do;
%put ERROR: [&SYSMACRONAME] fileRef is blank.;
%SetSystemOptions(&originalNOTES., &originalQUOTELENMAX.);
%abort cancel;
%end;
%if %length(&fileRef.) > 8 %then %do;
%put ERROR: [&SYSMACRONAME] Fileref &fileRef exceeds 8 character limit.;
%SetSystemOptions(&originalNOTES., &originalQUOTELENMAX.);
%abort cancel;
%end;
%if not %IsEmpty(dirData) %then %let dirData = %sysfunc(dequote(&dirData.));
********************************************************************
** Assign fileref according to link type
********************************************************************;
%if %IsEmpty(dirData)
and %IsEmpty(fileName)
and %IsEmpty(sheetName)
and %IsEmpty(range) %then %EstablishSystemLink(&fileRef.);
%else %EstablishWorkbookLink(&fileRef., &dirData., &fileName., &sheetName., &range.);
********************************************************************
** Check that link has been established
********************************************************************;
%let linkConnection = %sysfunc(fopen(&fileRef, S));
%if not (&linkConnection. > 0) %then %do;
/*Run until either Excel opens (linkConnection > 0)
or until 10 seconds have passed.*/
%let stopTime = %sysevalf(%sysfunc(datetime()) + 10);
%do %until (&linkConnection. > 0);
%if (%sysfunc(datetime()) >= &stopTime.) %then %do;
%put ERROR: [&SYSMACRONAME] DDE system link was not established. Operation timed out.;
%ClearFileRef(fileRef.);
%SetSystemOptions(&originalNOTES., &originalQUOTELENMAX.);
%abort cancel;
%end;
%let linkConnection = %sysfunc(fopen(&fileRef, S));
%end;
%end;
********************************************************************
** Housekeeping
********************************************************************;
%let closeReturnCode = %sysfunc(fclose(&linkConnection));
%mend;
%macro ReadHeaderIntoList(reference, delimiter, prefix, case) / minoperator mindelimiter=',';
%put NOTE: [MACRO] Executing: ReadHeaderIntoList(reference=&reference, delimiter=&delimiter, prefix=&prefix, case=&case);
%local fileSpecification notab delimiter;
%global listHeader;
%SetSystemOptions(nonotes);
%if %IsEmpty(reference) %then %do;
%put ERROR: [&SYSMACRONAME.] REFERENCE argument is blank.;
%SetSystemOptions(&originalNOTES., &originalQUOTELENMAX.);
%abort cancel;
%end;
********************************************************************
** Determine infile statement options
********************************************************************;
/*SAS filerefs exist only for Excel files*/
%if %IsFileRef(&reference.) %then %do;
%let fileSpecification = &reference.;
%let notab = notab;
%end;
/*Absolute references only for CSV,TXT,TSV files*/
%else %if %IsFilePath(&reference.) %then %do;
%let fileSpecification = "&reference.";
%let notab = ;
%let extension = %upcase(%scan(&reference., 1, '.', b));
%end;
%else %do;
%put ERROR: [&SYSMACRONAME.] Invalid input REFERENCE: [&reference.];
%SetSystemOptions(&originalNOTES., &originalQUOTELENMAX.);
%abort cancel;
%end;
********************************************************************
** Read in header
********************************************************************;
data _null_;
infile &fileSpecification. dlmstr = '```#@' ¬ab. obs = 1 lrecl = 32767 ;
length
raw_header_line $ 32767
raw_with_pipes $ 32767
;
input raw_header_line;
raw_with_pipes = tranwrd(raw_header_line, &delimiter., '|');
call symput('rawListHeader', strip(raw_with_pipes));
run;
********************************************************************
** Transform headers into valid variable names
********************************************************************;
%SetSystemOptions(noquotelenmax);
data _null_;
length
i 8
listLength 8
header_i $ 32767
temp_i $ 32767
listValid $ 32767
;
listLength = %ListLength(%superq(rawListHeader));
do i = 1 to listLength;
header_i = scan("%superq(rawListHeader)", i, '|', 'm');
**********************************
*** Apply prefix
**********************************;
if not missing(header_i) then prefixed_i = cats("&prefix.", header_i);
else prefixed_i = header_i;
**********************************
*** Apply case
**********************************;
if "&case." = "LOWER" then cased_i = lowcase(prefixed_i);
else if "&case." = "UPPER" then cased_i = upcase(prefixed_i);
else cased_i = prefixed_i;
**********************************
*** Keep valid otherwise correct
**********************************;
if nvalid(cased_i, 'v7') then do;
if i = 1 then listValid = cased_i;
else listValid = catx('|', listValid, cased_i);
end;
else do;
**********************************
*** Fill in blank headers
**********************************;
if missing(cased_i) and "&case." = "UPPER" then temp_i = "%upcase(&prefix.)NO_HEADER";
else if missing(cased_i) then temp_i = "&prefix.no_header";
**********************************
*** Replace blanks with _ and
*** Remove invalid characters
**********************************;
else do;
replaced_space_with_underscore = tranwrd(strip(cased_i), ' ', '_');
temp_i = compress(replaced_space_with_underscore, '_', 'kin');
end;
**********************************
*** Make first char _ if digit
**********************************;
if anydigit(temp_i) = 1 then temp_i = cats('_', temp_i);
**********************************
*** Trim length to 32
**********************************;
if length(temp_i) > 32 then temp_i = substr(temp_i, 1, 32);
**********************************
*** Verify valid V7 name
**********************************;
if not nvalid(temp_i, 'v7') then do;
put 'ERROR: [&SYSMACRONAME.] Error cleaning header ' i +(-1) '. Invalid SAS name.';
call execute('
%SetSystemOptions(&originalNOTES., &originalQUOTELENMAX.);
data _null_;
abort cancel nolist;
run;');
stop;
end;
if i = 1 then listValid = temp_i;
else listValid = catx('|', listValid, temp_i);
end;
output;
end;
call symput('listValid', strip(listValid));
run;
********************************************************************
** Append repeated headers with incremented value
********************************************************************;
/*Use hash table with key being each header and value
corresponding to the number of occurences. Create new
header list as follows: If first occurence of a header,
add to list. If not first occurence, ruthlessly append
occurence number (ensuring validity) and add to list.
Beware: SAS documentation for hashes contains syntax
errors.*/
data _null_;
length
element_i $ 32
item $ 32
occurrences 8
new_list $ 32767
;
declare hash h();
h.defineKey('item');
h.defineData('item', 'occurrences');
h.defineDone();
call missing(item, occurrences);
listLength = input("%ListLength(&listValid.)", 8.);
do i = 1 to listLength;
element_i = scan("&listValid.", i, '|');
if not (h.find(key: element_i) = 0) then do;
h.add(key: element_i, data: element_i, data: 1);
new_list = catx('|', new_list, element_i);
end;
else do;
occurrences + 1;
h.replace(key: element_i, data: element_i, data: occurrences);
len = length(element_i);
digits = ceil(log10(occurrences + 1));
if (len + digits) > 32 then
new_element = cats(substr(element_i, 1, len - digits), occurrences);
else new_element = cats(element_i, occurrences);
new_list = catx('|', new_list, new_element);
end;
end;
call symput('listHeader', strip(new_list));
run;
%mend;
%macro CreateLengthStatement(listHeader, defLength);
%local lengthStatement header_h;
%let lengthStatement=;
%do h = 1 %to %ListLength(&listHeader.);
%let header_h = %ListElement(listHeader, &h);
%if &h. = 1 %then %let lengthStatement = &header_h. $ &defLength. ;
%else %let lengthStatement = &lengthStatement. &header_h. $ &defLength. ;
%end;
%let lengthStatement = &lengthStatement;
&lengthStatement
%mend;
1 The solution uses macros extensively. In my experience, people advised me to avoid macros. I found it best, with all respect, to ignore that advice. SAS doesn't have functions, which makes developing abstractions hard. Macros allow you to imitate functions. A common fear of macros is debugging. Stick to the Single Responsibility Principle and you'll find they're not hard to debug at all. Document them with a %put
statement and you'll know who is being called and when. If you're not familiar with macros, they're really just text replacement. The code goes through a pre-processor and replaces macro code with text. That text, and the rest of your code, is then executed. The best resource for learning about macros is the manual.
SAS/ACCESS to PC FILES
licensed? I was under the impression thatDBMS=XLSX
required it as well. – Joeformat
statements (And, it turns out, other similar attribute statements) inproc import
directly! – Joedbms=xlxs
is now part of BASE. TheEXCEL
and the others still require PC Files. What the difference is, I have no idea. – DomPazzdbms=xlsx
doesn't work inproc export
. =( – Robert Penridge