I need an advice from guru of SAS :).
Suppose I have two big data sets. The first one is a huge data set (about 50-100Gb!), which contains phone numbers. The second one contains prefixes (20-40 thousands observations).
I need to add the most appropriate prefix to the first table for each phone number.
For example, if I have a phone number +71230000
and prefixes
+7
+71230
+7123
The most appropriate prefix is +71230
.
My idea. First, sort the prefix table. Then in data step, process the phone numbers table
data OutputTable;
set PhoneNumbersTable end=_last;
if _N_ = 1 then do;
dsid = open('PrefixTable');
end;
/* for each observation in PhoneNumbersTable:
1. Take the first digit of phone number (`+7`).
Look it up in PrefixTable. Store a number of observation of
this prefix (`n_obs`).
2. Take the first TWO digits of the phone number (`+71`).
Look it up in PrefixTable, starting with `n_obs + 1` observation.
Stop when we will find this prefix
(then store a number of observation of this prefix) or
when the first digit will change (then previous one was the
most appropriate prefix).
etc....
*/
if _last then do;
rc = close(dsid);
end;
run;
I hope my idea is clear enough, but if it's not, I'm sorry).
So what do you suggest? Thank you for your help.
P.S. Of course, phone numbers in the first table are not unique (may be repeated), and my algorithm, unfortunately, doesn't use it.