I have a sas data set that is about 8 gigabytes. I was wondering if there is an easy way I can read this data set in in the data step. It takes about 2 hours for the data step to complete.
2 Answers
Specific times and performance will depend on your hardware. However, some tips.
options compress=yes;will compress the dataset, saving potentially large amounts of space (depending on the data).options compress=char;is another option appropriate when character data is the majority of the space used.- Limit the number of times you read through the data. Write your program such that it doesn't need as many data passes. Consider using views, and when combining datasets using techniques like format or hash instead of sorting and joining.
- Use
PROC PRINTto view the data rather than browsing the dataset, as you can customize the results more effectively. - If you are on a server consider the
SPDEengine. That allows you to spread the data across multiple disks.
Joe's answer and the other comments so far answer the question directly - they do a good job of covering ways of increasing read speeds. However, I think it's also worth mentioning a few potential means of cutting down the number of records that actually need to be read, as this will also speed up each pass.
This is necessarily quite a speculative answer, but depending on what your code's doing, some of these might be worth investigating further.
Indexes
If you only want to process a relatively small proportion of records (< 20%) in each pass through your dataset, using a series of similar and reasonably simple where clauses in successive passes, you could consider creating indexes for the some of the variables in the where clauses. The smaller the proportion of records you are interested in, the greater the benefit will be from using indexes, as they will allow SAS to skip reading large sections of your dataset.
There is a one-off initial processing overhead when creating an index, and a further cost each time you alter the dataset. The index will also occupy some additional disk space.
Obs and firstobs
If you're only interested in observations in a particular position within your dataset (e.g. from observation 10000 to observation 20000 inclusive), you can skip straight to those ones and ignore all the others via the obs= and firstobs= options (in this case, via obs = 20000 and firstobs = 10000).
Sortedby
If the dataset happens to be sorted by a variable used in a where clause, you can also use the sortedby= option on a set statement to speed up where clause processing.