0
votes

Dataset HAVE includes id values and a character variable of names. Values in names are usually missing. If names is missing for all values of an id EXCEPT one, the obs for IDs with missing values in names can be deleted. If names is completely missing for all id of a certain value (like id = 2 or 5 below), one record for this id value must be preserved.

In other words, I need to turn HAVE:

id  names
1
1
1   Matt, Lisa, Dan
1
2
2
2
3
3
3   Emily, Nate
3
4
4
4   Bob
5

into WANT:

id  names
1   Matt, Lisa, Dan
2
3   Emily, Nate
4   Bob
5

I currently do this by deleting all records where names is missing, then merging the results onto a new dataset KEY with one variable id that contains all original values (1, 2, 3, 4, and 5):

data WANT_pre;
set HAVE;
if names = " " then delete;
run;

data WANT;
merge KEY
      WANT_pre;
   by id;
run;

This is ideal for HAVE because I know that id is a set of numeric values ranging from 1 to 5. But I am less sure how I could do this efficiently (A) on a much larger file, and (B) if if I couldn't simply create an id KEY dataset by counting from 1 to n. If your HAVE had a few million observations and your id values were more complex (e.g., hexadecimal values like XR4GN), how would you produce WANT?

4
Do you ever have different values for names in different rows with the same id? If so, which names would you keep? - user667489

4 Answers

2
votes

You can use SQL here easily, MAX() applies to character variables within SQL.

proc sql; 
create table want as
select id, max(names) as names
from have
group by ID;
quit;

Another option is to use an UPDATE statement instead.

data want;
update have (obs=0) have;
by ID;
run;
1
votes

This seems like a good candidate for a DOW-loop, assuming that your dataset is sorted by id:

data want;
do until(last.id);
  set have;
  by id;
  length t_names $50; /*Set this to at least the same length as names unless you want the default length of 200 from coalescec*/ 
  t_names = coalescec(t_names,names);
end;
names = t_names;
drop t_names;
run;
1
votes
proc summary data=have nway missing;
   class id;
   output out=want(drop=_:) idgroup(max(names) out(names)=);
   run;
1
votes

Use the UPDATE statement. That will ignore the missing values and keep the last non-missing value. It normally requires a master and transaction dataset, but you can use your single dataset for both.

data want;
  update have(obs=0) have ;
  by id;
run;