Creating a table of duplicates from SAS data set with over 50 variables

votes

I have a large SAS data set (54 variables and over 10 million observations) I need to load into Teradata. There are duplicates that must also come along, and my machine is not configured for MultiLoad. I want to simply create a table of the 300,000 duplicates I can append to the original load that did not accept them. The logic I've read in other posts seems good for tables with just a few variables. Is there another way that will create a new table where each observation having the same combination of all 54 variables is listed. I'm trying to avoid the proc sort...by logic using 54 variables. The query builder method seemed inefficient as well. Thanks.

duplicatessas

4 Answers

votes

Using proc sort is a good way to do it, you just need to create a nicer way to key off of it.

Create some test data.

data have;
  x = 1;
  y = 'a';
  output; 
  output;
  x = 2;
  output;
run;

Create a new field that is basically equivalent to appending all of the fields in the row together and then running them though the md5() (hashing) algorithm. This will give you a nice short field that will uniquely identify the combination of all the values on that row.

data temp;
  length hash $16;
  set have;
  hash = md5(cats(of _all_));
run;

Now use proc sort and our new hash field as the key. Output the duplicate records to the table named 'want':

proc sort data=temp nodupkey dupout=want;
  by hash;
run;

votes

You can do something like this:

proc sql;
create table rem_dups as 
select <key_fields>, count(*) from duplicates
group by <key_fields>
having count(*) > 1;
quit; 

proc sql; 
create table target as 
select dp.* from duplicates dp 
left join rem_dups rd 
on <Key_fields>
where <key_fields> is null;
quit;

If there are more than 300K duplicates, this option does not work. And also, I am afraid to say that I dont know about Teradata and the way you load tables.

votes

First, a few sort related suggestions, then the core 'fast' suggestion after the break.

If the table is entirely unsorted (ie, the duplicates can appear anywhere in the dataset), then proc sort is probably your simplest option. If you have a key that will guarantee putting duplicate records adjacent, then you can do:

proc sort data=have out=uniques noduprec dupout=dups;
  by <key>;
run;

That will put the duplicate records (note noduprec not nodupkey - that requires all 54 variables to be identical) in a secondary dataset (dups in the above). However, if they are not physically adjacent (ie, you have 4 or 5 duplicates by the key but only two are completely duplicated), it may not catch that if they are not adjacent physically; you would need a second sort, or you would need to list all variables in your by statement (which might be messy). You could also use Rob's md5 technique to simplify this some.

If the table is not 'sorted' but the duplicate records will be adjacent, you can use by with the notsorted option.

data uniques dups;
  set have;
  by <all 54 variables> notsorted;
  if not (first.<last variable in the list>) then output dups;
  else output uniques;
run;

That tells SAS not to complain if things aren't in proper order, but lets you use first/last. Not a great option though particularly as you need to specify everything.

The fastest way to do this is probably to use a hash table for this, iff you have enough RAM to handle it, or you can break your table up in some fashion (without losing your duplicates). 10m rows times 54 (say 10 byte) variables means 5.4GB of data, so this only works if you have 5.4GB of RAM available to SAS to make a hash table with.

If you know that a subset of your 54 variables are sufficient for verifying uniqueness, then the unique hash only has to contain those subset of variables (ie, it might only be four or five index variables). The dups hash table does have to contain all variables (since it will be used to output the duplicates).

This works by using modify to quickly process the dataset, not rewriting the majority of the observations; using remove to remove them and the hash table output method to output the duplicates to a new dataset. The unq hash table is only used for lookup - so, again, it could contain a subset of variables.

I also use a technique here for getting the full variable list into a macro variable so you don't have to type 54 variables out.

data class;   *make some dummy data with a few true duplicates;
  set sashelp.class;
  if age=15 then output;
  output;
run;

proc sql;
  select quote(name) 
    into :namelist separated by ','
    from dictionary.columns
    where libname='WORK' and memname='CLASS'
  ;  *note UPCASE names almost always here;
quit;

data class;
  if 0 then set class;
  if _n_=1 then do;               *make a pair of hash tables;
     declare hash unq();
     unq.defineKey(&namelist.);
     unq.defineData(&namelist.);
     unq.defineDone();
     declare hash dup(multidata:'y'); *the latter allows this to have dups in it (if your dups can have dups);
     dup.defineKey(&namelist.);
     dup.defineData(&namelist.);
     dup.defineDone();
  end;
  modify class end=eof;
  rc_c = unq.check();           *check to see if it is in the unique hash;
  if rc_c ne 0 then unq.add();  *if it is not, add it;
  else do;                      *otherwise add it to the duplicate hash and mark to remove it;
    dup.add();
    delete_check=1;
  end;

  if eof then do;                      *if you are at the end, output the dups;
    rc_d = dup.output(dataset:'work.dups');
  end;

  if delete_check eq 1 then remove;        *actually remove it from unique dataset;
run;

-2

votes

Instead of trying to avoid proc sort, I would recommend you to use Proc sortwith index.

Read the document about index

I am sure there must be identifier(s) to distinguish observation other than _n_, and with the help of index, sorting by noduprecs or nodupkey dupout = dataset would be an efficient choice. Furthermore, indexing could also facilitate other operation such as merging / reporting.

Anyway, I do not think a dataset with 10 million observations ( each?) is a good dataset, and not to mention the 54 variables.