0
votes

I'm wondering about something that doesn't seem efficient to me.

I have 2 tables, one very large table DATA (millions of rows and hundreds of cols), with an id as primary key.

I then have another table, NEW_COL, with variable rows (1 to millions) but alwas 2 cols : id, and new_col_name.

I want to update the first table, adding the new_data to it.

Of course, i know how to do it with a proc sql/left join, or a data step/merge.

Yet, it seems inefficient, as far as I see with time executing, (which may be wrong), these 2 ways of doing rewrite the huge table completly, even when NEW_DATA is only 1 row (almost 1 min).

I tried doing 2 sql, with alter table add column then update, but it's waaaaaaaay too slow as update with joining doesn't seem efficient at all.

So, is there an efficient way to "add a column" to an existing table WITHOUT rewriting this huge table ?

Thanks!

3

3 Answers

2
votes

SAS datasets are row stores and not columnar stores like tables in other databases. As such, adding rows is far easier and efficient than adding columns. A key joined view could be argued as the most 'efficient' way to add a column to a data rectangle.

If you are adding columns so often that the 1 min resource incursion is a problem you may need to upgrade hardware with faster drives, less contentious operating environment, or more memory and SASFILE if the new columns are often yet temporary in nature.

2
votes

@Richard answer is perfect. If you are adding columns on regular basis then there is problem with your design. You either need to give more details on what you are doing and someone can suggest you.

I would try hash join. you can find code for simple hash join. This is efficient way of joining because in your case you have one large table and one small table if it fit into memory, it much better than a left join. I have done various joins using and query run times was considerably less( to order of 10)

By Altering table approach you are rewriting the table and also it causes lock on your table and nobody can use the table.

You should perform this joins when workload is less, which means during not during office and you may need to schedule the jobs in night, when more SAS resources are available

0
votes

Thanks for your answers guys.

To add information, i don't have any constraint about table locking, balance load or anything as it's a "projet tool" script I use.

The goal is, in data prep step 'starting point data generator', to recompute an already existing data, or add a new one (less often but still quite regularly). Thus, i just don't want to "lose" time to wait for the whole table to rewrite while i only need to update one data for specific rows.

When i monitor the servor, the computation of the data and the joining step are very fast. But when I want tu update only 1 row, i see the whole table rewriting. Seems a waste of ressource to me.

But it seems it's a mandatory step, so can't do much about it.

Too bad.