0
votes

I have a bunch of character variables which I need to sort out from a large dataset. The unwanted variables all have entries that are the same or are all missing (meaning I want to drop these from the dataset before processing the data further). The data sets are very large so this cannot be done manually, and I will be doing it a lot of times so I am trying to create a macro which will do just this. I have created a list macro variable with all character variables using the following code (The data for my part is different but I use the same sort of code):

data test;
     input Obs ID Age;
datalines;
1 2 3
2 2 1
3 2 2
4 3 1
5 3 2
6 3 3
7 4 1
8 4 2
run;

proc contents
data = test
noprint
out = test_info(keep=name);
run;

proc sql noprint;
select name into : testvarlist separated by ' ' from test_info;
quit;

My idea is then to just use a data step to drop this list of variables from the original dataset. Now, the problem is that I need to loop over each variable, and determine if the observations for that variable are all the same or not. My idea is to create a macro that loops over all variables, and for each variable counts the occurrences of the entries. Since the length of this table is equal to the number of unique entries I know that the variable should be dropped if the table is of length 1. My attempt so far is the following code:

%macro ListScanner (org_list);
%local i next_name name_list;
%let name_list = &org_list;
%let i=1;
%do %while (%scan(&name_list, &i) ne );
    %let next_name = %scan(&name_list, &i);
    %put &next_name;    
        proc sql;
            create table char_occurrences as
            select &next_name, count(*) as numberofoccurrences
            from &name_list group by &next_name;    
            select count(*) as countrec from char_occurrences;
        quit;
    %if countrec = 1 %then %do;
        proc sql;
            delete &next_name from &org_list;
        quit;
    %end;
    %let i = %eval(&i + 1);
    %end;
%mend;

%ListScanner(org_list = &testvarlist);

Though I get syntax errors, and with my real data I get other kinds of problems with not being able to read the data correctly but I am taking one step at a time. I am thinking that I might overcomplicate things so if anyone has an easier solution or can see what might be wrong to I would be very grateful.

2
Don't understand what you tried to do, in your macro program, problem is &name_list in proc sql, &name_list is resolved as variables list, not a dataset.Shenglin Chen
You can always roll your own, but there are solutions out there. I suggest the searching the SAS blogs, or Google: "sas rick wicklin missing" . The fastest approach is via IML, the 'easiest' coding approach is PROC FREQ and using the NLEVELS variable as well. It also depends on what is a large file. Less than 10 million records should be fine with the PROC FREQ approach, though it may depend on your memory settings as well.Reeza
You mention character variables but your example is numeric variables. Are the variables character or numeric?Reeza

2 Answers

0
votes

As far as your specific initial question, this is fairly straightforward. Assuming &testvarlist is your macro variable containing the variables you are interested in, and creating some test data in have:

%let testvarlist=x y z;

data have;
  call streaminit(7);
  do id = 1 to 1e6;
    x = floor(rand('Uniform')*10);
    y = floor(rand('Uniform')*10);
    z = floor(rand('Uniform')*10);
    if x=0 and y=4 and z=7 then call missing(of x y z);
    output;
  end;
run;


data want fordel;
  set have;
  if min(of &testvarlist.) = max(of &testvarlist.)
    and (cmiss(of &testvarlist.)=0 or missing(min(of &testvarlist.)))
  then output fordel;
  else output want;
run;

This isn't particularly inefficient, but there are certainly better ways to do this, as referenced in comments.

0
votes

There are many ways to do this posted around. But let's just look at the issues you are having.

First for looping through your space delimited list of names it is easier to let the %do loop increment the index variable for you. Use the countw() function to find the upper bound.

%do i=1 %to %sysfunc(countw(&name_list,%str( )));
  %let next_name = %scan(&name_list,&i,%str( ));
  ...
%end;

Second where is your input dataset in your SQL code? Add another parameter to your macro definition. Where to you want to write the dataset without the empty columns? So perhaps another parameter.

%macro ListScanner (dsname , out, name_list);
%local i next_name sep drop_list ;

Third you can use a single query to count all of variables at once. Just use count( distinct xxxx ) instead of group by.

proc sql noprint;
create table counts as 
  select 
  %let sep=;
  %do i=1 %to %sysfunc(countw(&name_list,%str( )));
    %let next_name = %scan(&name_list,&i,%str( ));
    &sep. count(distinct &next_name) as &next_name
    %let sep=,;
  %end;
  from &dsname 
;
quit;

So this will get a dataset with one observation. You can use PROC TRANSPOSE to turn it into one observation per variable instead.

proc transpose data=counts out=counts_tall ;
  var _all_;
run;

Now you can just query that table to find the names of the columns with 0 non-missing values.

proc sql noprint ;
  select _name_ into :drop_list separated by ' '
    from counts_tall
    where col1=0
  ;
quit;

Now you can use the new DROP_LIST macro variable.

data &out ;
  set &dsname ;
  drop &drop_list;
run;

So now all that is left is to clean up after your self.

proc delete data=counts counts_tall ;
run;
%mend;