1
votes

We have a large dataset in SAS environment having 30 million records and small datasets having 100 000 records each.

We have to left join these smaller datasets with "Large" Table, and it is taking around 30-40 minutes for joining the smaller dataset with large dataset.

If we run the job for 5-6 datasets individually, it took a lot of time. If we merge all these datasets in to a single dataset and then do the left join, will it take less time compared to individual? Also, there is space crunch in WORK so we have to consider that also.

2
How do you read those datasets? Do you have SQL database there? Or you do REST API calls?evgeny.myasishchev
@evgeny.myasishchev we use SAS enterprise, so all datasets are there on SAS. no API call.rns
If you have sufficient amount of RAM (size of small tables is less than RAM) , the perfect solution is a SAS hash table.Robert Soszyński
Put an index on the smaller data sets, use a data step, and then lookup using key.Gordon Linoff
@GordonLinoff should we merge all the tables and then create a Index or create index on individual table and then join them. And will it reduce the processing time ?rns

2 Answers

0
votes

Try a hash join

DATA want;
IF 0 THEN SET SMALL_DATA_SET;
if _N_ = 1 then do;
declare hash HASH_NAME(dataset:"SMALL_DATA_SET", multidata:'y');
HASH_NAME.defineKEY("YOUR_KEY_VARIABLE");
HASH_NAME.defineData (ALL:'YES');
HASH_NAME.defineDone();
END;
set BIG_DATA_SET;
IF
HASH_NAME.FIND(KEY:YOUR_KEY_VARIABLE) = 0 THEN OUTPUT;
RUN;
0
votes

The best solution for this kind of scenario is to use the modify statement, this modifies the master dataset in place without recreating it from scratch. Always remember to have a backup of the datasets you are modify-ing because if anything goes wrong during processing the dataset will be unusable.

This paper has loads of info, but there's plenty of documentation to be found.

http://www.lexjansen.com/pnwsug/2006/PN01CurtisMackModify.pdf

Cheers