0
votes

I have approximately 27,000 .csv files on a SAS directory, these have also been copied onto a network drive so I have the option of using either.

I need to merge all the csv files into one data set and delete any empty rows (the empty rows contain a comma in column A).

Each CSV file has a unique name but the structure and formatting is the same.

In the final dataset… I’d like column A to contain the filename with data being copied from the import file range A1:down to S1:down into columns B to T.

I’ve tried with the below code, but it fails after around 1,000 files:

x 'cd C:\temp';

filename csv ('*.csv');

proc import out=work.LSPImportFiles
datafile = csv DBMS=CSV REPLACE;
GETNAMES=yes;
run;

I’ve also tried the code below, but this seems to miss out columns 8 and 9 and fails after few a hundred:

data want;
    length _filename_ $32;
    infile C:\temp\*.csv" dlm = ',' filename = _filename_;
    input @;
    if _filename_ ne lag1(_filename_) then delete;
input
Column_1    :$15.
Column_2    :$16.
Column_3    /*this is a number to 2 decimal places*/
Column_4    /*this is a number to 2 decimal places*/
Column_5    /*this is a number to 2 decimal places*/
Column_6    /*this is a number to 2 decimal places*/
Column_7    /*this is a number to 2 decimal places*/
Column_8    /*this is a percentage*/
Column_9    /*this is a percentage*/
Column_10   /*this is a number to 2 decimal places*/
Column_11   /*this is a number to 2 decimal places*/
Column_12   /*this is a number to 2 decimal places*/
Column_13   /*this is a number to 2 decimal places*/
Column_14   /*this is a blank column*/
Column_15   /*this is a number to 2 decimal places*/
Column_16   /*this is a number to 2 decimal places*/
Column_17   /*this is a number to 2 decimal places*/
run;
1
If columns 8 and 9 have percents signs then you should use PERCENT in-format or perhaps COMMA. I would also include INFILE statement options DSD and MISSOVER. It would be good to see the log notes from the data step, at least the ones that relate to data errors and going to a new line. I don't thing we need to see 27000 "INFILE IS" notes. I suspect you have flowover and other "issues".data _null_
What does "fail" mean?Joe

1 Answers

1
votes

You will need to define two variables for the name since the one referenced on the INFILE statement is automatically dropped. If you have a real CSV file then you will want to use the DSD option to handle empty fields properly, especially since you seem to indicate that at least one of columns will be empty. It is better to explicitly define the variables instead of making SAS guess based on what type format you assign or informat you use. Also it is good to use the TRUNCOVER option to avoid SAS jumping to the next line if a line has fewer fields than expected.

data want;
  length _filename_ filename $32 Column_1 $15 Column_2 $16  Column_3-Column_17 8;
  infile "C:\temp\*.csv" dsd dlm = ',' filename = _filename_ truncover;
  input @;
  if _filename_ ne lag1(_filename_) then delete;
  input Column_1 - Column_17 ;
  filename=_filename_;
run;

Depending on how the percentages are coded in the CSV file you might need to add this statement so that SAS will accept values like 10%.

  informat column_8 column_9 percent. ;

You can add any other processing you want to the data step. For example to remove lines where the first column is empty (which what I assume you meant by saying the column A contains a comma) you could add this line before the run statement.

 if missing(column_1) then delete;