Suppose the data set have
contains various outliers which have been identified in an outliers
data set. These outliers need to be replaced with missing values, as demonstrated below.
Have
Obs group replicate height weight bp cholesterol
1 1 A 0.406 0.887 0.262 0.683
2 1 B 0.656 0.700 0.083 0.836
3 1 C 0.645 0.711 0.349 0.383
4 1 D 0.115 0.266 666.000 0.015
5 2 A 0.607 0.247 0.644 0.915
6 2 B 0.172 333.000 555.000 0.924
7 2 C 0.680 0.417 0.269 0.499
8 2 D 0.787 0.260 0.610 0.142
9 3 A 0.406 0.099 0.263 111.000
10 3 B 0.981 444.000 0.971 0.894
11 3 C 0.436 0.502 0.563 0.580
12 3 D 0.814 0.959 0.829 0.245
13 4 A 0.488 0.273 0.463 0.784
14 4 B 0.141 0.117 0.674 0.103
15 4 C 0.152 0.935 0.250 0.800
16 4 D 222.000 0.247 0.778 0.941
Want
Obs group replicate height weight bp cholesterol
1 1 A 0.4056 0.8870 0.2615 0.6827
2 1 B 0.6556 0.6995 0.0829 0.8356
3 1 C 0.6445 0.7110 0.3492 0.3826
4 1 D 0.1146 0.2655 . 0.0152
5 2 A 0.6072 0.2474 0.6444 0.9154
6 2 B 0.1720 . . 0.9241
7 2 C 0.6800 0.4166 0.2686 0.4992
8 2 D 0.7874 0.2595 0.6099 0.1418
9 3 A 0.4057 0.0988 0.2632 .
10 3 B 0.9805 . 0.9712 0.8937
11 3 C 0.4358 0.5023 0.5626 0.5799
12 3 D 0.8138 0.9588 0.8293 0.2448
13 4 A 0.4881 0.2731 0.4633 0.7839
14 4 B 0.1413 0.1166 0.6743 0.1032
15 4 C 0.1522 0.9351 0.2504 0.8003
16 4 D . 0.2465 0.7782 0.9412
The "get it done" approach is to manually enter each variable/value combination in a conditional which replaces with missing when true.
data have;
input group replicate $ height weight bp cholesterol;
datalines;
1 A 0.4056 0.8870 0.2615 0.6827
1 B 0.6556 0.6995 0.0829 0.8356
1 C 0.6445 0.7110 0.3492 0.3826
1 D 0.1146 0.2655 666 0.0152
2 A 0.6072 0.2474 0.6444 0.9154
2 B 0.1720 333 555 0.9241
2 C 0.6800 0.4166 0.2686 0.4992
2 D 0.7874 0.2595 0.6099 0.1418
3 A 0.4057 0.0988 0.2632 111
3 B 0.9805 444 0.9712 0.8937
3 C 0.4358 0.5023 0.5626 0.5799
3 D 0.8138 0.9588 0.8293 0.2448
4 A 0.4881 0.2731 0.4633 0.7839
4 B 0.1413 0.1166 0.6743 0.1032
4 C 0.1522 0.9351 0.2504 0.8003
4 D 222 0.2465 0.7782 0.9412
;
run;
data outliers;
input parameter $ 11. group replicate $ measurement;
datalines;
cholesterol 3 A 111
height 4 D 222
weight 2 B 333
weight 3 B 444
bp 2 B 555
bp 1 D 666
;
run;
EDIT: Updated outliers
so that parameter
avoids truncation and changed measurement
to be numeric type so as to match the corresponding height
, weight
, bp
, cholesterol
. This shouldn't change the responses.
data want;
set have;
if group = 3 and replicate = 'A' and cholesterol = 111 then cholesterol = .;
if group = 4 and replicate = 'D' and height = 222 then height = .;
if group = 2 and replicate = 'B' and weight = 333 then weight = .;
if group = 3 and replicate = 'B' and weight = 444 then weight = .;
if group = 2 and replicate = 'B' and bp = 555 then bp = .;
if group = 1 and replicate = 'D' and bp = 666 then bp = .;
run;
This, however, doesn't utilize the outliers
data set. How can the replacement process be made automatic?
I immediately think of the IN=
operator, but that won't work. It's not the entire row which needs to be matched. Perhaps an SQL key matching approach would work? But to match the key, don't I need to use a where
statement? I'd then effectively be writing everything out manually again. I could probably create macro variables which contain the various if
or where
statements, but that seems excessive.