Merge huge data sets in SAS

Question

I need an advice from guru of SAS :).
Suppose I have two big data sets. The first one is a huge data set (about 50-100Gb!), which contains phone numbers. The second one contains prefixes (20-40 thousands observations). I need to add the most appropriate prefix to the first table for each phone number.

For example, if I have a phone number +71230000 and prefixes

+7
+71230
+7123

The most appropriate prefix is +71230.

My idea. First, sort the prefix table. Then in data step, process the phone numbers table

data OutputTable;
    set PhoneNumbersTable end=_last;
    if _N_ = 1 then do;
        dsid = open('PrefixTable');
    end;
    /* for each observation in PhoneNumbersTable:
       1. Take the first digit of phone number (`+7`).
          Look it up in PrefixTable. Store a number of observation of
          this prefix (`n_obs`).
       2. Take the first TWO digits of the phone number (`+71`).
          Look it up in PrefixTable, starting with `n_obs + 1` observation.
          Stop when we will find this prefix
          (then store a number of observation of this prefix) or
          when the first digit will change (then previous one was the
          most appropriate prefix).
       etc....
    */
    if _last then do;
        rc = close(dsid);
    end;
run;

I hope my idea is clear enough, but if it's not, I'm sorry).

So what do you suggest? Thank you for your help.

P.S. Of course, phone numbers in the first table are not unique (may be repeated), and my algorithm, unfortunately, doesn't use it.

Since this question was addressed to SAS gurus, and also involved big data, I was wondering if one of you could help me answer this general question on SAS Data Storage Options and Big Data: datascience.stackexchange.com/questions/12619/…. I'm trying to compare SAS Data Storage to regular RDBMS like SQL Server. Any help on this is truly appreciated. — Minu

Chris J Chris J · Accepted Answer · 2015-09-15T08:59:17

There are a couple of ways you could do this, you could use a format or a hash-table.

Example using format :

/* Build a simple format of all prefixes, and determine max prefix length */
data prefix_fmt ;
  set prefixtable end=eof ;
  retain fmtname 'PREFIX' type 'C' maxlen . ;
  maxlen = max(maxlen,length(prefix)) ; /* Store maximum prefix length */
  start = prefix ;
  label = 'Y' ;
  output ;
  if eof then do ;
    hlo = 'O' ;
    label = 'N' ;
    output ;

    call symputx('MAXPL',maxlen) ;
  end ;

  drop maxlen ;
run ;
proc format cntlin=prefix_fmt ; run ; 

/* For each phone number, start with full number and reduce by 1 digit until prefix match found */
/* For efficiency, initially reduce phone number to length of max prefix */
data match_prefix ;
  set phonenumberstable ;

  length prefix $&MAXPL.. ;

  prefix = '' ;
  pnum = substr(phonenumber,1,&MAXPL) ;

  do until (not missing(prefix) or length(pnum) = 1) ;
    if put(pnum,$PREFIX.) = 'Y' then prefix = pnum ;
    pnum = substr(pnum,1,length(pnum)-1) ; /* Drop last digit */
  end ;
  drop pnum ;
run ;

Merge huge data sets in SAS

3 Answers