Automatically replace outlying values with missing values

Question

Suppose the data set have contains various outliers which have been identified in an outliers data set. These outliers need to be replaced with missing values, as demonstrated below.

Have

   Obs    group    replicate     height     weight         bp    cholesterol

     1      1          A          0.406      0.887      0.262        0.683
     2      1          B          0.656      0.700      0.083        0.836
     3      1          C          0.645      0.711      0.349        0.383
     4      1          D          0.115      0.266    666.000        0.015
     5      2          A          0.607      0.247      0.644        0.915
     6      2          B          0.172    333.000    555.000        0.924
     7      2          C          0.680      0.417      0.269        0.499
     8      2          D          0.787      0.260      0.610        0.142
     9      3          A          0.406      0.099      0.263      111.000
    10      3          B          0.981    444.000      0.971        0.894
    11      3          C          0.436      0.502      0.563        0.580
    12      3          D          0.814      0.959      0.829        0.245
    13      4          A          0.488      0.273      0.463        0.784
    14      4          B          0.141      0.117      0.674        0.103
    15      4          C          0.152      0.935      0.250        0.800
    16      4          D        222.000      0.247      0.778        0.941

Want

     Obs    group    replicate    height    weight      bp      cholesterol

       1      1          A        0.4056    0.8870    0.2615       0.6827
       2      1          B        0.6556    0.6995    0.0829       0.8356
       3      1          C        0.6445    0.7110    0.3492       0.3826
       4      1          D        0.1146    0.2655     .           0.0152
       5      2          A        0.6072    0.2474    0.6444       0.9154
       6      2          B        0.1720     .         .           0.9241
       7      2          C        0.6800    0.4166    0.2686       0.4992
       8      2          D        0.7874    0.2595    0.6099       0.1418
       9      3          A        0.4057    0.0988    0.2632        .
      10      3          B        0.9805     .        0.9712       0.8937
      11      3          C        0.4358    0.5023    0.5626       0.5799
      12      3          D        0.8138    0.9588    0.8293       0.2448
      13      4          A        0.4881    0.2731    0.4633       0.7839
      14      4          B        0.1413    0.1166    0.6743       0.1032
      15      4          C        0.1522    0.9351    0.2504       0.8003
      16      4          D         .        0.2465    0.7782       0.9412

The "get it done" approach is to manually enter each variable/value combination in a conditional which replaces with missing when true.

data have;
  input group replicate $ height weight bp cholesterol;
  datalines;
1 A 0.4056 0.8870 0.2615 0.6827
1 B 0.6556 0.6995 0.0829 0.8356
1 C 0.6445 0.7110 0.3492 0.3826
1 D 0.1146 0.2655 666    0.0152
2 A 0.6072 0.2474 0.6444 0.9154
2 B 0.1720 333    555    0.9241
2 C 0.6800 0.4166 0.2686 0.4992
2 D 0.7874 0.2595 0.6099 0.1418
3 A 0.4057 0.0988 0.2632 111   
3 B 0.9805 444    0.9712 0.8937
3 C 0.4358 0.5023 0.5626 0.5799
3 D 0.8138 0.9588 0.8293 0.2448
4 A 0.4881 0.2731 0.4633 0.7839
4 B 0.1413 0.1166 0.6743 0.1032
4 C 0.1522 0.9351 0.2504 0.8003
4 D 222    0.2465 0.7782 0.9412
;
run;

data outliers;
  input parameter $ 11. group replicate $ measurement;
  datalines;
cholesterol 3  A  111
height      4  D  222
weight      2  B  333
weight      3  B  444
bp          2  B  555
bp          1  D  666
  ;
run;

EDIT: Updated outliers so that parameter avoids truncation and changed measurement to be numeric type so as to match the corresponding height, weight, bp, cholesterol. This shouldn't change the responses.

data want;
  set have;

  if group = 3 and replicate = 'A' and cholesterol  = 111 then cholesterol = .;
  if group = 4 and replicate = 'D' and height       = 222 then height      = .;
  if group = 2 and replicate = 'B' and weight       = 333 then weight      = .;
  if group = 3 and replicate = 'B' and weight       = 444 then weight      = .;
  if group = 2 and replicate = 'B' and bp           = 555 then bp          = .;
  if group = 1 and replicate = 'D' and bp           = 666 then bp          = .;
run;

This, however, doesn't utilize the outliers data set. How can the replacement process be made automatic?

I immediately think of the IN= operator, but that won't work. It's not the entire row which needs to be matched. Perhaps an SQL key matching approach would work? But to match the key, don't I need to use a where statement? I'd then effectively be writing everything out manually again. I could probably create macro variables which contain the various if or where statements, but that seems excessive.

In your example, anything over 100 would be an outlier. Is that your rule in general? It may be easier to load the levels into a temporary array and then use an array to loop and assign it to missing. — Reeza
The group/replicate can be considered a unique identifier. And, no, the outliers could be any value. I had used those dummy values for myself to facilitate creating fake data and then later realized that it introduced a pattern. However, I decided to leave them in as the technique used in a patterned case could be different from the general case, yet still be informative. If this goes against SO protocol, I'll gladly update the question. — Lorem Ipsum

John John · Accepted Answer · 2017-05-06T03:26:47

I don't think generating statements is excessive in this case. The complexity arises here because your outlier dataset cannot be merged easily since the parameter values represent variable names in the have dataset. If it is possible to reorient the outliers dataset so you have a 1 to 1 merge, this logic would be simpler.

Let's assume you cannot. There are a few ways to use a variable in a dataset that corresponds to a variable in another.

You could use an array like array params{*} height -- cholesterol; and then use the vname function as you loop through the array to compare to the value in the parameter variable, but this gets complicated in your case because you have a one to many merge, so you would have to retain the replacements and only output the last record for each by group... so it gets complicated.
You could transpose the outliers data using proc transpose, but that will get lengthy because you will need a transpose for each parameter, and then you'd need to merge all the transposed datasets back to the have dataset. My main issue with this method is that code with a bunch of transposes like that gets unwieldy.

You create the macro variable logic you are thinking might be excessive. But compared to the other ways of getting the values of the parameter variable to match up with the variable names in the have dataset, I don't think something like this is excessive:

data _null_;
    set outliers;
    call symput("outlierstatement"||_n_,"if group = "||group||" and replicate = '"||replicate||"' and "||parameter||" = "||measurement||" then "|| parameter ||" = .;");
    call symput("outliercount",_n_);
run;

%macro makewant();
    data want;
        set have;
        %do i = 1 %to &outliercount;
            &&outlierstatement&i;
        %end;
    run;
%mend;

Automatically replace outlying values with missing values

5 Answers