0
votes

I'm having huge performance issues with a SAS DI job that I need to get up and running. Therefore I'm looking for clever ways to optimize the job.

One thing in particular that I thought of is that I should perhaps permute the order of some joins and an append. Currently, my job is configured as follows:

there are several similarly structured source tables which I first apply a date filter to (to reduce the number of rows) and sort on two fields, say a and b, then I left join each table to a table with account table on the same fields a and b (I'd like to create indexes for these if possible, but don't know how to do it for temporary work tables in SAS DI). After each of these joins is complete, I append the resulting tables into one dataset.

It occurs to me that I could first append, and then do just one join, but I have no notion of which approach is faster, or if the answer is that it depends I have no notion of what it depends on (though I'd guess the size of the constituent tables).

So, is it better to do many joins then append, or to append then do one join?

EDIT

Here is an update with some relevant information (requested by user Robert Penridge).

  • The number of source tables here is 7, and the size of these tables ranges from 1500 to 5.2 million. 10 000 is typical. The number of columns is 25. These tables are each being joined with the same table, which has about 5000 rows and 8 columns.
  • I estimate that the unique key partitions the tables into subsets of roughly equal size; the size reduction here should be between 8% and 30% (the difference is due to the fact that some of the source tables carry much more historical data than others, adding to the percentage of the table grouped into the same number of groups).
  • I have limited the number of columns to the exact minimum amount required (21).
  • By default SAS DI creates all temporary datasets as views, and I have not changed that.
  • The code for the append and joins are auto-generated by SAS DI after constructing them with GUI elements.
  • The final dataset is not sorted; my reason for sorting the data which feeds the joins is that the section of this link on join performance (page 35) mentions that it should improve performance.
  • As I mentioned, I'm not sure if one can put indexes on temporary work tables or views in SAS DI.
  • I cannot say whether the widths of the fields is larger than absolutely necessary, but if so I doubt it is egregious. I hesitate to change this since it would have to be done manually, on several tables, and when new data comes in it might need that extra column width.

Much gratitude

2
How many joins? How many tables? What are the row counts in each table? How wide is each table? If you are considering indexes, what is the cardinality of the unique key in the table vs the number of rows in the table. Are you sure you need the sort step in there? It's not required for the join, so if you don't need the final dataset sorted in a particular way then you can drop that.Robert Penridge

2 Answers

0
votes

Performance in SAS is mainly about reducing IO (ie. reading/writing to the disk).

Without additional details it's difficult to help but some additional things you can consider are:

  • limit the columns you are processing by using a keep statement (reduces IO)
  • if the steps performing the joins are IO intensive, consider using views rather than creating temporary tables
  • if the joins are still time consuming, consider replacing them with hash table lookups
  • make sure you are using proc append to append the 2 datasets together to reduce the IO. Append the smaller dataset to the larger dataset.
  • consider not sorting the final dataset but placing an index on it for consumers of the data.
  • ensure you are using some type of dataset compression, or ensure your column widths are set appropriately for all columns (ie. you don't have a width of 200 on a field that uses a width of 8)
  • reduce the number of rows as early in the process as possible (you are already doing this, just listing it here for completeness)

Adjusting the order of left-joins and appends probably won't make as much difference as doing the above.

0
votes

As per your comments it seems that
1. There are 7 input source tables
2. Join these 7 source tables to 1 table
3. Append the results

In SAS DI studio, use a Lookup to perform the above much faster
1. Connect the 7 Input tables to a Lookup Transform (lets call them SRC 1-7) 2. The table with 5000 records is the tables on which lookup is performed on keys A and B (lets call this LKUP-1) 3. Take the relevant columns from LKUP-1 to propagate into the TARGET tables.

This will be much faster and you don't have to perform JOINs in this case as I suspect you are doing a Many-Many join which is degrading the performance in SAS DIS.