SAS Enterprise Guide / SQL Performance

Question

I'm looking for a little guidance on a SAS/SQL performance issue I'm having. In SAS Enterprise Guide, I've created a program that creates a table. This table has about 90k rows:

CREATE TABLE test AS (
  SELECT id, SUM(myField)
  FROM table1
  GROUP BY id
)

I have a much larger table with millions of rows. Each row has an id. I want to sum values on this table, using only id's present in the 'test' table. I tried this:

CREATE TABLE test2 AS(
  SELECT big.id, SUM(big.myOtherField)
  FROM big
  INNER JOIN test
    ON test.id = big.id
  GROUP BY big.id
)

The problem I'm having is that it takes forever to run the second query against the big table with millions of records. I thought the inner join on the subset of id's would help (and maybe it is) but I wanted to make sure I was doing everything I could to speed it up.

I don't have any way to get information on the indexing of the underlying database. I'm more interested in getting the opinion of someone who has more SQL and SAS experience than me.

It's hard to know what the problem is here without a better definition of "millions of records" and "takes forever". Assuming no indexes, you'd expect the db to sort the big table on id and do a hash join with your new table "test", which shouldn't take that long. — antlersoft
How long does the inner join take, without the group by? IE, just create table test2 as select big.id from big inner join test on test.id=big.id; — Joe
Also, how are you connecting to this database, and what sort of database? You say you're doing this in EG; is this EG connected to a SAS server, or EG using a local SAS install? How close network-wise is the database server (on the same machine, in the same room, in the same building, across the ocean...) — Joe
Sorry for the vague information. I'm not sure how many rows are in the big table. There are 90k id's in the subset, and tens of millions at least in the overall pool. The EG is connected to a SAS server. I'm trying to run the inner join with no group-by now. I did some date-filtering that did reduce the time and at least return some results. Everything works except for the speed. It may just be that the table is too big! — Jeffrey Kramer
So there's no other DBMS involved (SQL/Oracle/etc.)? Just SAS datasets? — Joe

BellevueBob BellevueBob · Accepted Answer · 2013-07-09T18:38:12

From what you show in your question, you are joining two SAS data sets, not two database objects. In any case, you can speed up the processing by defining indexes on the JOIN columns used in each table. Assuming you have permission to do so, here are examples:

proc sql;
   create index id on big(id);
   create index id on test(id);
quit;

Of course, you probably should first check the table definition before doing that. You can use the "describe" statement to see the structure:

proc sql;
   describe table big;
quit;

Indexes improve access performance at the cost of disk space and update maintenance. Once created, the indexes will be a permanent part of the SAS data set and will be automatically updated if you use SQL INSERT or DELETE statements. But be aware that the indexes will be deleted if you recreate the data set with a simple data step.

On the other hand, if these tables really are in an external database (like Oracle for example), you have a different challenge. If that's the case, I'd ask a new question and provide a complete example of the SAS code you are using (including and libname statements).

SAS Enterprise Guide / SQL Performance

2 Answers