0
votes

I have a sas data set consisting of more than 100 variables. Variables pid - a character variable and year - a numeric variable identify observations in the data set.

How can I create a new data set consisting of observations that have unique pid and year combination. That is, if a given pid and year combination occurs more than once, I want to delete all the associated observations, not just the duplicates.

4

4 Answers

3
votes

I don't use much of data step. I use proc sql and is easy for me.

proc sql;
    create table new_dataset as
    select * from old_dataset as a
      join
    (select pid, year, count(1) from old_dataset group by pid, year having count(1)<2)
    as b on a.pid=b.pid and a.year=b.year;
run;

inner query only gets pid and year which occur once. Any multiple occurrence of pid and year are not taken into account because of having count(1)<2. I get those observations only from original by joining back on pid and year. This actually doesn't need sorting.

Let me know in case of any questions.

2
votes

Simple use of first. and last. in a data step will do this. Run proc sort if the data is not already sorted by pid and year.

proc sort data=have;
by pid year;
run;

data want;
set have;
by pid year;
if first.year and last.year then output; /* only keep unique rows */
run;
2
votes

Use the UNIQUEOUT and NOUNIQUEKEY option in proc sort for a single step solution.

data class;
set sashelp.class;
run;

proc sort data=class nouniquekey uniqueout=unique_data;
by sex age;
run;

http://support.sas.com/documentation/cdl/en/proc/65145/HTML/default/viewer.htm#p0qh2iuz3fa6rpn1eib1gaxr0sb5.htm

1
votes

You can generate a dataset containing combinations of pid and year that appear more than once, then merge it with the rest to remove matches:

proc sort data = have nodupkey dupout = duplicates;
    by pid year;
run;

data want;
    merge have 
          duplicates(in = a keep = pid year);
    by pid year;
    if not(a);
run;