2
votes

Since the the length of the value of a macro variable cannot exceed the maximum length of (65534), I can't create a single macro variable for all of my observations. I would like to create a macro to iterate through my data set to generate several numeric lists that I can pass to a where statement in proc sql.

Instead of this:

*proc sql noprint;
    select ID into :ParaList separated by ','
    from work.ID_LIST(**firstobs=1 obs=5000**);
quit;*

*proc sql noprint;
    select ID into :ParaList2 separated by ','
    from work.ID_LIST(**firstobs=5001 obs=10000**);
quit;*

*proc sql noprint;
    select ID into :ParaList3 separated by ','
    from work.ID_LIST(**firstobs=10001 obs=15000**);
quit;*

*proc sql noprint;
    select ID into :ParaList4 separated by ','
    from work.ID_LIST(**firstobs=15001 obs=20000**);
quit;*

I'd like something like:

*proc sql noprint;
    select ID into :List1-Last4 separated by ','
    from work.ID_LIST(**firstobs=1 obs=&LASTOBS** BY 5000);
quit;*

I'd like to create a macro to loop through every 5000 observations or so until last observation that I can pass into a where statement such as where id in (&ParaList,&ParaList2,&ParaList3,&ParaList4). I know there are alternatives such as

id in (select id from work.table)

but in this case, it doesn't work. I am querying Hadoop through SAS and haven't had any success except passing macro variable lists.

4
For a slightly different approach, take a look at the %ds2list macro in this answer: stackoverflow.com/questions/49899251/… . It will stream the list as output so it doesn't suffer from the same limitations of a macro variable.Robert Penridge
Regarding id in (select id from work.table) can you use the SASTRACE options to log what lower level activity the SAS/Access library engine is doing ?Richard

4 Answers

1
votes

Can you create temporary tables in Hadoop ? Upload your ID_LIST data to a temp table and then use it in a pass through query.

libname hdp      hadoop noprompt="… connection parameters …";
libname hdp_temp hadoop noprompt="… connection parameters …" dbmstemp=yes;

proc delete data=hdp_temp.id_list;run;
data hdp_temp.id_list;
  set work.my_id_list;
run;

* im not sure on the Hadoop side object naming patterns and default schemas, 
* so this code shows dbo. as is the case in SQL Server;
* SAS libname option dmbstemp=yes for SQL Server causes created tables to be
* referenced as dbo.##<tablename>;

proc sql;
  connect using hadoop;
  create table work.susbset_of_big as
  select * from connection to Hadoop
  ( select * from dbo.my_big_remote_table
    where id in (select id from dbo.##id_list)
  );
quit;
0
votes

You could easily use a data step to generate the macro variables. You should also generate a macro variable that calls all of the other macro variables.

%let n_per_list=5 ;
data _null_;
  length idlist $32000;
  length macrolist $1000 ;
  retain macrolist;
  do i=1 to &n_per_list until (eof);
    set id_list end=eof;
    idlist=catx(',',idlist,id);
  end;
  listno+1;
  call symputx(cats('paralist',listno),idlist);
  macrolist=catx(',',macrolist,cats('&','paralist',listno));
  call symputx('paralist',macrolist);
run;

Which for a simple test of 20 values split into groups of 5 yields this result:

151  %put Paralist=%superq(ParaList);
Paralist=&paralist1,&paralist2,&paralist3,&paralist4
152  %put &=Paralist1;
PARALIST1=1,2,3,4,5
153  %put &=Paralist;
PARALIST=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20

Or you could just look at generating the code into a macro instead of using macro variables. That should not have any limits on how long a list you could generate. You could try to figure out how to open the dataset while inside a macro without generating any SAS code so that the result of the macro call is just the list of values. But it would be much easier to generate the source code for the macro definition to a file and then %include the file to define it.

filename code temp;
data _null_;
  set id_list end=eof;
  file code lrecl=80;
  if _n_=1 then put '%macro paralist;' ;
  else put ',' @ ;
  put id @ ;
  if eof then put / '%mend paralist;' ;
run;

Result for same trivial 20 value list.

163  %include code / source2;
NOTE: %INCLUDE (level 1) file CODE is file /.../#LN00043.
164 +%macro paralist;
165 +1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9 ,10 ,11 ,12 ,13 ,14 ,15 ,16 ,17 ,18 ,19 ,20
166 +%mend paralist;
NOTE: %INCLUDE (level 1) ending.
167  %put %paralist;
1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9 ,10 ,11 ,12 ,13 ,14 ,15 ,16 ,17 ,18 ,19 ,20
0
votes

I like @Richard solution, it will make make it clean and more efficient. you can also try implicit pass through, if your dataset is not extremly large.

 Libname SASTAB "yoursaslocation";
 Libname HTAB  your hadoop parameters;

 proc sql;
 create HTAB.newtable as 
 select * froom HTAB.yourtable
 where id in (Select id from SASTAB.yoursastable);
0
votes

You can do this with a data step do loop, and call execute().

data _null_;
set lib.id_list (nobs=totobs);

do i = 1 to totobs by 5000;

call execute(cat(
    'proc sql;
     select ID into :paralist', i,' separated by ','
     from lib.id_list (firstobs=', i,
    ' obs=5000); quit;'
    ));

call execute(cats('%put |paralist',i,'|;'));

end;
run;

Read more about execute function call in data step REF1 REF2

The above call generates the line

proc sql;
select id into:paralist1 separated by ','
from lib.id_list (firstobs=1 obs=5000);
quit;

proc sql;
select id into:paralist5001 separated by ','
from lib.id_list (firstobs=5001 obs=5000);
quit;

/* etc */