2
votes

I need an advice from guru of SAS :).
Suppose I have two big data sets. The first one is a huge data set (about 50-100Gb!), which contains phone numbers. The second one contains prefixes (20-40 thousands observations). I need to add the most appropriate prefix to the first table for each phone number.

For example, if I have a phone number +71230000 and prefixes

+7
+71230
+7123

The most appropriate prefix is +71230.

My idea. First, sort the prefix table. Then in data step, process the phone numbers table

data OutputTable;
    set PhoneNumbersTable end=_last;
    if _N_ = 1 then do;
        dsid = open('PrefixTable');
    end;
    /* for each observation in PhoneNumbersTable:
       1. Take the first digit of phone number (`+7`).
          Look it up in PrefixTable. Store a number of observation of
          this prefix (`n_obs`).
       2. Take the first TWO digits of the phone number (`+71`).
          Look it up in PrefixTable, starting with `n_obs + 1` observation.
          Stop when we will find this prefix
          (then store a number of observation of this prefix) or
          when the first digit will change (then previous one was the
          most appropriate prefix).
       etc....
    */
    if _last then do;
        rc = close(dsid);
    end;
run;

I hope my idea is clear enough, but if it's not, I'm sorry).

So what do you suggest? Thank you for your help.

P.S. Of course, phone numbers in the first table are not unique (may be repeated), and my algorithm, unfortunately, doesn't use it.

3
Since this question was addressed to SAS gurus, and also involved big data, I was wondering if one of you could help me answer this general question on SAS Data Storage Options and Big Data: datascience.stackexchange.com/questions/12619/…. I'm trying to compare SAS Data Storage to regular RDBMS like SQL Server. Any help on this is truly appreciated.Minu

3 Answers

2
votes

There are a couple of ways you could do this, you could use a format or a hash-table.

Example using format :

/* Build a simple format of all prefixes, and determine max prefix length */
data prefix_fmt ;
  set prefixtable end=eof ;
  retain fmtname 'PREFIX' type 'C' maxlen . ;
  maxlen = max(maxlen,length(prefix)) ; /* Store maximum prefix length */
  start = prefix ;
  label = 'Y' ;
  output ;
  if eof then do ;
    hlo = 'O' ;
    label = 'N' ;
    output ;

    call symputx('MAXPL',maxlen) ;
  end ;

  drop maxlen ;
run ;
proc format cntlin=prefix_fmt ; run ; 

/* For each phone number, start with full number and reduce by 1 digit until prefix match found */
/* For efficiency, initially reduce phone number to length of max prefix */
data match_prefix ;
  set phonenumberstable ;

  length prefix $&MAXPL.. ;

  prefix = '' ;
  pnum = substr(phonenumber,1,&MAXPL) ;

  do until (not missing(prefix) or length(pnum) = 1) ;
    if put(pnum,$PREFIX.) = 'Y' then prefix = pnum ;
    pnum = substr(pnum,1,length(pnum)-1) ; /* Drop last digit */
  end ;
  drop pnum ;
run ;
2
votes

Here's another solution which works very well, speed-wise, as long as you can work under one major (maybe okay) restriction: phone numbers can't start with a 0, and have to be either numeric or convertible to numeric (ie, that "+" needs to be unnecessary to look up).

What I'm doing is building an array of 1/null flags, one 1/null flag per possible prefix. Except this doesn't work with a leading 0: since '9512' and '09512' are the same number. This could be worked around - adding a '1' at the start (so if you have possible 6 digits of prefix, then everything is 1000000+prefix) for example would work - but it would require adjusting the below (and might have performance implications, though I think it wouldn't be all that bad). If "+" is also needed, that might need to be converted to a digit; here you could say anything with a "+" gets 2000000 added to the beginning, or something like that.

The nice thing is this only takes 6 queries (or so) of an array at most per row - quite a bit faster than any of the other search options (since temporary arrays are contiguous blocks of memory, it's just "go check 6 memory addresses that are pre-computed"). Hash and format will be a decent chunk slower, since they have to look up each one anew.

One major performance suggestion: Pay attention to which way your prefixes will likely fail to match. Checking 6 then 5 then 4 then ... might be faster, or checking 1 then 2 then 3 then ... might be faster. It all depends on the actual prefixes themselves, and the actual phone numbers. If most of your prefixes are "+11" and things like that, you almost certainly want to start from the left if that mans a number with "94" will be quickly found as not matching.

With that, the solution.

data prefix_match;
  if _n_=1 then do;
    array prefixes[1000000] _temporary_;
    do _i = 1 to nobs_prefix;
      set prefixes point=_i nobs=nobs_prefix;
      prefixes[prefix]=1;
    end;
    call missing(prefix);
  end;  
  set phone_numbers;
  do _j = 6 to 1 by -1;
    prefix = input(substr(phone_no,1,_j),6.);
    if prefix ne 0 and prefixes[prefix]=1 then leave;
    prefix=.;
  end;
  drop _:;
run;

Against a test set, which had 40k prefixes and 100m phone numbers (and no other variables), this ran in a bit over 1 minute on my (good) laptop, versus 6 and change with the format solution and 4 and change with the hash solution (modifying it to output all rows, since the other two solutions do). That seems about right to me performance-wise.

0
votes

Here's a hashtable example.

Generate some dummy data.

data phone_numbers(keep=phone)
     prefixes(keep=prefix);
     ;

  length phone $10 prefix $4;
  do i=1 to 10000000;
    phone = cats(int(ranuni(0) * 9999999999 + 1));
    len = int(ranuni(0) * 4 + 1);
    prefix = substr(phone,1,len);
    if input(phone,best.) ge 1000000000 then do;
      output;
    end;
  end;

run;

Assuming the longest prefix is 4 chars, try finding a match with the longest first, then continue until the shortest prefix has been tried. If a match is found, output the record and move on to the next observation.

data ht;
  attrib prefix length=$4;

  set phone_numbers;

  if _n_ eq 1 then do;
    declare hash ht(dataset:"prefixes");
    ht.defineKey('prefix');
    ht.defineDone();
  end;

  do len=4 to 1 by -1;
    prefix = substr(phone,1,len);
    if ht.find() eq 0 then do;
      output;
      leave;
    end;
  end;

  drop len;

run;

May need to add logic if a match isn't found to output the record and leave the prefix field blank? Not sure how you want to handle that scenario.