Collapsing a large dataset while conditionally preserving some missing values

Question

Dataset HAVE includes id values and a character variable of names. Values in names are usually missing. If names is missing for all values of an id EXCEPT one, the obs for IDs with missing values in names can be deleted. If names is completely missing for all id of a certain value (like id = 2 or 5 below), one record for this id value must be preserved.

In other words, I need to turn HAVE:

id  names
1
1
1   Matt, Lisa, Dan
1
2
2
2
3
3
3   Emily, Nate
3
4
4
4   Bob
5

into WANT:

id  names
1   Matt, Lisa, Dan
2
3   Emily, Nate
4   Bob
5

I currently do this by deleting all records where names is missing, then merging the results onto a new dataset KEY with one variable id that contains all original values (1, 2, 3, 4, and 5):

data WANT_pre;
set HAVE;
if names = " " then delete;
run;

data WANT;
merge KEY
      WANT_pre;
   by id;
run;

This is ideal for HAVE because I know that id is a set of numeric values ranging from 1 to 5. But I am less sure how I could do this efficiently (A) on a much larger file, and (B) if if I couldn't simply create an id KEY dataset by counting from 1 to n. If your HAVE had a few million observations and your id values were more complex (e.g., hexadecimal values like XR4GN), how would you produce WANT?

Do you ever have different values for names in different rows with the same id? If so, which names would you keep? — user667489

Reeza Reeza · Accepted Answer · 2019-02-13T18:44:09

You can use SQL here easily, MAX() applies to character variables within SQL.

proc sql; 
create table want as
select id, max(names) as names
from have
group by ID;
quit;

Another option is to use an UPDATE statement instead.

data want;
update have (obs=0) have;
by ID;
run;

Collapsing a large dataset while conditionally preserving some missing values

4 Answers