0
votes

I heard that SAS stores character variables in chunks of 8 bytes.

Therefore, the thinking goes we should always assign the length of the character variables to be a multiple of 8.

I have searched and could not find any support for the initial assertion.

Is it true? Is this covered somewhere in the documentation?

3
@Joe, the question is not about variable names, but variable lengths, as set by the length statement on a data set.jaamor
Then your question makes even less sense; but you can verify this easily on your own, can you not?Joe
True, I can verify empirically. I was hoping that someone who knows would share their insight on how SAS stores data.jaamor
If you wanted to know the answer to the above, you should've just ... tested it. If you want to know something about how SAS stores data, ask that question.Joe
I did my own empirical test and added it below as an answer. Now people can look this question up.jaamor

3 Answers

2
votes

This is true for datasets that contain no 8 byte numeric variables. I will post separately for datasets that do.


No, there is nothing special about 8 byte character variable lengths.

See the below:

data length8;
  length char0001-char9999 $8;
  call missing(of _all_);
  do _i = 1 to 100; 
    output;
  end;
  drop _i;
run;
data length7;
  length char0001-char9999 $7;
  call missing(of _all_);
  do _i = 1 to 100; 
    output;
  end;
  drop _i;
run;

data length4;
  length char0001-char9999 $4;
  call missing(of _all_);
  do _i = 1 to 100; 
    output;
  end;
  drop _i;
run;

data length12;
  length char0001-char9999 $12;
  call missing(of _all_);
  do _i = 1 to 100; 
    output;
  end;
  drop _i;
run;

data length16;
  length char0001-char9999 $16;
  call missing(of _all_);
  do _i = 1 to 100; 
    output;
  end;
  drop _i;
run;

data length17;
  length char0001-char9999 $17;
  call missing(of _all_);
  do _i = 1 to 100; 
    output;
  end;
  drop _i;
run;

Each of these datasets is of different size, roughly proportional to the length of the character variables. Note that the 4 size is a bit bigger proportionally (on my machine, anyway): in fact, 4,5,6 are all the same size. This is because of the page size: the minimum page size on my installation is 64kb (65535 bytes), and 4,5,6 all can only fit one row of data in that (roughly 40, 50, and 60kb rows). It's not because of any particular size being saved for a character variable, but instead because of the total length of the data record.

That's where you could potentially have a savings by altering a small amount: if your data happen to be arranged such that the page size is just under double the size of the row, then making the row just slightly smaller will save you half of the space. That's unlikely to occur except on a very small number of cases though - it requires a very wide row (many variables, or very long character variables). You also can alter the page size with options, though, which may be the better way to deal with edge cases like this.

1
votes

For datasets that contain a numeric variable, as @jaamor's example included, there is a difference that does have some impact on storage related to 8 byte size. It will not usually have a significant impact on dataset size, except on a very tall and narrow dataset, but for datasets that are very tall and narrow, it may be a consideration.

When a numeric variable that is 8 bytes (the default) in length, SAS places those numeric variables at the end of the data vector, and starts them at a multiple of 8 bytes, presumably to aid in efficiency at accessing those predictable numeric variables. Any other variable other than an 8 byte numeric will be placed at the start of the data vector, and then any padding needed to bring that up to a multiple of 8 bytes is added, and then the numeric 8 byte variables are placed after that.

This can be seen by looking at the proc contents output from some example datasets.

data fourteen_eight;
  length x y $7;  *14 total;
  length i 8;
run;

data twelve_eight;
  length x y $6;  *12 total;
  length i 8;
run;

data twelve_six;
  length x y $6;  *12 total;
  length i 6;
run;

data twelve_six_eight;
  length x y $6;
  length z 6;
  length i 8;
run;

fourteen_eight has a conceptual observation length of 22, but a physical observation length of 24 (looking at PROC CONTENTS). twelve_eight has a conceptional length of 20, but a physical observation length of 24 as well. twelve_six has a conceptual length of 18, and a physical observation length of 18 - meaning no buffer if the numeric variable isn't 8 long. twelve_six_eight has a conceptual length of 26, and a physical size of 32: 18 rounded up to 24, and then the 8 at the end. (You can verify it's not allocating 8 for each numeric variable by simply adding several more 6 byte numbers; they never increase the total padding, and fit neatly in a smaller space.)

Here's how it ends up looking:

  • x $6
  • y $6
  • z 6
  • i 8

would fit like so:

[00000000011111111112222222222333333333344444444445]
[12345678901234567890123456789012345678901234567890]
[xxxxxxyyyyyyzzzzzz      iiiiiiii]

One side note: I'm not 100% sure that it's not [iiiiiiiixxxxxxyyyyyyzzzz ]. That would work just as well as far as being able to predict the location of numeric variables. It doesn't really affect this, though: either way, yes, there will be a small buffer if your total non-8-byte-numeric storage is not a multiple of 8 bytes if you do have one or more 8 byte numeric variables.

0
votes

As Joe said, I did test empirically using the below script:

libname testlen "<directory>";

%macro create_ds(length=, dsName=);
    data &dsName;
        length x $&length.;
        do i=1 to 1000000;
            x="";
            output;
        end;
    run;
%mend;

%macro create_all_ds;
    %do i=1 %to 20;
        %create_ds(length=&i, dsName=testlen.len&i)
    %end;
%mend;

%create_all_ds

All datasets have one variable. The length of the variable varies across datasets, starting from 1 to 20.

Datasets 1-8 take up ~15.8 MB

Datasets 9-16 take up ~23.7 MB

Datasets 16-20 take up ~31.5 MB

This probably means that it is not space efficient to declare SAS variable lengths that are not multiples of 8 for 1 variable datasets.

I tried a similar test for 2 variable datasets:

 %macro create_ds(length=, dsName=);
    data &dsName;
        length x y $&length.;
        do i=1 to 1000000;
            x="";
            y="";
            output;
        end;
    run;
%mend;

%macro create_all_ds;
    %do i=1 %to 20;
        %create_ds(length=&i, dsName=testlen.len&i)
    %end;
%mend;

%create_all_ds

The results are as follows:

Datasets 1-4 take up ~15.8 MB

Datasets 5-8 take up ~23.7 MB

This could mean that for efficient length declarations the sum of the length of the character variables should be a multiple of eight.