1
votes

and sorry for my bad english

I have to optimize a script, which is written in this way

1) A proc sql to retrieve data

proc sql;
Create table table2 as 
Select A.Id
       ,A.UsefulData1
       ,B.UsefulData2
       ,...
       ,B.UsefulDataN

FROM Table1 as A

LEFT JOIN TableParam1 AS B
ON  A.KeyA = B.KeyA
AND A.KeyB = B.KeyB

/* here : several left join */

LEFT JOIN TableParam2 AS N
ON  A.KeyA = N.KeyA
AND A.KeyB = N.KeyB
;
quit;

2) Then a data step, to compute lot of data

data table3;
set table2;
/* here : lot of complex computes depending on useful data of the previous step */
keep id, computedData1, ..., computedDataN
run;
/* then add index on the id for later join */

/* if not debug, delete the proc sql table */

And then, it repeats the same logic of "proc sql / data step", where the new proc sql use the computed data of the previous data step. In the end I have severale couple of "proc SQL table / data step table".

It's very clear and easy to understand, but i find it not very efficient as the table resulting of the proc sql is only a "temporary table", only useful for the data step. So, i write a temporary table, which is quit costly in terme of time as they are big tables (> 10 millions lines)

In my mind, the most efficient thing would be to keep the "proc sql table" in memory, and then only ouput the "data step table". I would save a huge time of writing.

Is there a way to do such a thing? Or do you see another most efficient method ?

Thanks for your help!

2
Did you try making SQL query as a VIEW instead of TABLE?Tom
I think it's challenging to answer this question without more information. What are your subsequent data/proc steps doing? Can you use a BY statement and use the data in one pass? Does your data fit in RAM? Can you use a hash table or something else to avoid loading the data twice/thrice/etc.? Can you do all of the data splitting in a single pass? Too little information to go on to optimize here I think.Joe
@Tom In my experience it's unlikely the view would not be materialized on disk, unfortunately, when there are joins involved.Joe

2 Answers

0
votes

Yeah, first thought was the same as Tom's - change the first PROC SQL line to create view table2 as Then you at least don't use the storage of the intermediate table.

But if the SQL is only merging datasets it is entirely unnecessary because you can just do that in the data step. Replace set table2 with a merge statement and merge all your datasets right there.

0
votes

What is the nature of the 'complex computes' ? If they are row-wise only computations you can code them in the SQL.

The index can be built using a data set option (in either create table table-name, or data table-name)

proc sql;
  create table want (index=(id/unique)) as
  select …
  , … as complicated_result1
  … 
  , … as complicated_resultN
  from T1
  left join T2 on …
  left join T3 on …
  ...

or

data want (index=(id/unique));
  merge T1 T2 T3 …;
  by key1 key2 … keyN;
  complicated_result1 = … 
  … 
  complicated_resultN = … 
  …