1
votes

I have monthly datasets in SAS Library for customers from Jan 2013 onwards with datasets name as CUST_JAN2013,CUST_FEB2013........CUST_OCT2017. These customers datasets have huge records of 2 million members for each month.This monthly datset has two columns (customer number and customer monthly expenses).

I have one input dataset Cust_Expense with customer number and month as columns. This Cust_Expense table has only 250,000 members and want to pull expense data for each member from SPECIFIC monthly SAS dataset by joining customer number.

Cust_Expense
------------
Customer_Number  Month
111              FEB2014
987              APR2017
784              FEB2014
768              APR2017
.....
145              AUG2017
345              AUG2014

I have tried using call execute, but it tries to loop thru each 250,000 records of input dataset (Cust_Expense) and join with corresponding monthly SAS customer tables which takes too much of time. Is there a way to read input tables (Cust_Expense) by month so that we read all customers for a specific month and then read the same monthly table ONCE to pull all the records from that month, so that it does not loop 250,000 times.

3
what do you want the result to look like?user2877959
@Joe, you are saying that asking (in the form of a question, not criticism) for a sample desired output, so that I can try to provide a solution that outputs the desired result, is indication of not understanding how the review process is supposed to work?user2877959
@user2877959 No, that was in response to other comments that were before/after yours. They've been moderator deleted, it looks like. Your response is good (and your edit helpful).Joe

3 Answers

1
votes

Depending on what you want the result to be, you can create one output per month by filtering on cust_expenses per month and joining with the corresponding monthly dataset

%macro want;
proc sql noprint;
select distinct month
into :months separated by ' '
from cust_expenses
;
quit;

proc sql;
%do i=1 %to %sysfunc(countw(&months));
  %let month=%scan(&months,&i,%str( ));
    create table want_&month. as
    select *
    from cust_expense(where=(month="&month.")) t1
    inner join cust_&month. t2
      on t1.customer_number=t2.customer_number
    ;
%end;
quit;
%mend;
%want;

Or you could have one output using one join by 'unioning' all those monthly datasets into one and dynamically adding a month column.

%macro want;
proc sql noprint;
select distinct month
into :months separated by ' '
from cust_expenses
;
quit;

proc sql;
  create table want as
  select *
  from cust_expense t1
  inner join (
              %do i=1 %to %sysfunc(countw(&months));
                %let month=%scan(&months,&i,%str( ));
                %if &i>1 %then union;
                select *, "&month." as month
                from cust_&month
              %end;
             ) t2
    on t1.customer_number=t2.customer_number
     and t1.month=t2.month
  ;
quit;
%mend;
%want;

In either case, I don't really see the point in joining those monthly datasets with the cust_expense dataset. The latter does not seem to hold any information that isn't already present in the monthly datasets.

1
votes

Your first, best answer is to get rid of these monthly separate tables and make them into one large table with ID and month as key. Then you can simply join on this and go on your way. Having many separate tables like this where a data element determines what table they're in is never a good idea. Then index on month to make it faster.

If you can't do that, then try creating a view that is all of those tables unioned. It may be faster to do that; SAS might decide to materialize the view but maybe not (but if it's extremely slow, then look in your temp table space to see if that's what's happening).

Third option then is probably to make use of SAS formats. Turn the smaller table into a format, using the CNTLIN option. Then a single large datastep will allow you to perform the join.

data want;
  set jan feb mar apr ... ;
  where put(id,CUSTEXPF1.) = '1';
run;

That only makes one pass through the 250k table and one pass through the monthly tables, plus the very very fast format lookup which is undoubtedly zero cost in this data step (as the disk i/o will be slower).

-1
votes

I guess you could output your data in specific dataset like this example :

data test;
infile datalines dsd;
   input ID : $2. MONTH  $3. ;
   datalines;
1,JAN
2,JAN
3,JAN
4,FEB
5,FEB
6,MAR
7,MAR
8,MAR
9,MAR
; 
run;

data  JAN FEB MAR;
set test;
if MONTH = "JAN" then output JAN;
if MONTH = "FEB" then output FEB;
if MONTH = "MAR" then output MAR;
run;

You will avoid to loop through all your ID (250000) and you will use dataset statement from SAS

At the end you will get 12 DATASET containing the ID related.

If you case, FEB2014 , for example, you will use a substring fonction and the condition in your dataset will become :

...
set test;
...
if SUBSTR(MONTH,1,3)="FEB" then output FEB;
...

Regards