0
votes

For a data that has the following variable: Student_ID, Start_week, Sy, Item, Type, School, Tuition and Country. If an observation has a same combination of the (Student_Id, Start_week, Sy, Item, Type and School), it is a duplicate observation.

For instance:

Student_ID Start_week Sy Item Type School
10001           1     11  101   0    2
10001           1     11  101   0    2

The two observation is a duplicate because it has same value for the combination. What I was doing was by:

proc freq data = mydataset;
 by Student_ID;
 tables Start_week Sy Item Type School;
run;

However, this didn't really help me to see what are duplicates and what are not. I wanted to create a count variable so to count duplicates, but it didn't capture the combination but just the Student_ID. Moreover, by using proc freq, it was running out of memory.

What are some effective ways to identify the duplicate values?

2

2 Answers

1
votes

PROC SORT has a relatively new option called NOUNIQUEKEY that will remove all unique records based on the key variables.

data h;
 input name $ age h;
 datalines;
 kir 1 1
 kir 1 2
 nir 1 3
    ;

 proc sort data=h nouniquekey out=duplicates;
 by  name age;
 run;
0
votes

one way to find duplicates is to sql or proc sort for all variables

data h;
input name $ age ;
datalines;
kir 1 
kir 1
nir 1
;

proc sql;
select * from h
group by name, age
having count(*) gt 1;

proc sort data=h nodup dupout = new;
by _all_;
run;

if it is fewer variables(not all variables) use nodupkey and below is code if you are not doing on all variables

 data h;
 input name $ age h;
 datalines;
 kir 1 1
 kir 1 2
 nir 1 3
    ;

 proc sql;
 select name, age from h
 group by name, age
 having count(*) gt 1;

 proc sort data=h nodupkey dupout = new(keep=name age);
 by  name age;
   run;