1
votes

I'm looking a way to impute using proc iml in sas the average of the sales of the next two months. As you can see sometimes I dont have the sales of 201901 and sometimes is missing on 201902 For example for the first barcode I want to find the sales[1]= mean(sales[2],sales[3]) and I want to do this for each unique barcode.

The "table A" is like this:

Obs. |  Barcode  |    date   |   sales   | Position
---------------------------------------------------------------
1    |21220000000|  201901   |     .     |   1

2   |21220000000|   201902|      311     |   2

3   |21220000000|   201903|      349     |   3

4   |21220000000|   201904|      360     |   4

5   |21220000000|   201905|      380     |   5

6   |21220000000|   201906|      440     |   6

7   |21220000000|   201907|      360     |   7

8   |21220000000|   201908|      390     |   8

9   |21220000000|   201909|      410     |   9

10  |21220000000|   201910|      520     |  10

11  |21220000000|   201911|      410     |  11

12  |21220000000|   201912|      390     |  12

13  |31350000000|   201901|      360     |   1

14  |31350000000|   201902|      .       |   2


                   .etc.
24  |31350000000|   201912|       .      |   12

25  |45480000000|   201901|      310     |   1     

26  |45480000000|   201902|        .     |   2

                   .etc.

I tried something like this but it doesnt work:

proc iml;
t_a= TableCreateFromDataSet("work","table_a");
call TablePrint(t_a); 

do i =1 to nrow(t_a);
  
      if t_a[i,4]=. and t_a[i,5]=1 then t_a[1,4]= mean(t_a[i+1,4],t_a[i+2,4]) ;
      
   i=i+1;
end;
run;

Is there a way to do it using matrices or lists in proc iml or would you recommend any other ways? Thank you in advance!

2
Is it a requirement to do this in IML?draycut
No! it is just a way I thought it. Maybe it will be easier using lag() but I wanted to avoid the extra columns.Student1212

2 Answers

2
votes

This problem only involves an ID variable (='BarCode') and a variable that has missing values (='Sales'), so you really only need to read and process two vectors.

An efficient approach is to iterate over the unique levels of the "Barcode" variable (an ID variable) and process each missing value. Thus you can reduce the problem to a "BY group analysis" in which each ID value is processed in turn. There are several ways to perform a BY-group analysis in IML. The easiest to understand and implement is the UNIQUE-LOC technique. For large data, the UNIQUEBY technique is more efficient.

The following example uses the UNIQUE-LOC technique:

proc iml;
use table_a;
   read all var {"BarCode"} into ID;
   read all var {"Sales"} into X;
close;

imputeX = X;             /* make copy of X */
u = unique(ID);          /* unique categories of the ID variable */
do i = 1 to ncol(u);     /* for each ID level */
   groupIdx = loc(ID=u[i]);
   y = x[groupIdx];      /* get the values for this level */
   k = loc( y=. );       /* which are missing? */
   if ncol(k)>0 then do; /* if some are missing, do imputation */
      n = nrow(y);
      startIdx = ((k+1) >< n);  /* starting location, don't exceed n */
      stopIdx  = ((k+2) >< n);  /* ending location, don't exceed n */
      values = y[ startIdx ] || y[ stopIdx ];
      mean = values[ , :];      /* find mean of each row */
      y[k] = mean;              /* copy mean to missing values */
      imputeX[groupIdx] = y;    /* update imputed vector (optional: write data) */
   end;
end;

print ID[F=Z11.] X imputeX;
1
votes

I don't think this is a good solution in PROC IML to your problem, but I can tell you where you're going wrong in your particular approach. Hopefully Rick or someone else can stop by to show the right IML way to solve this using Matrix operations, or you can browse the Do Loop as I'm fairly sure Rick has articles on imputation there.

That said, your issue here is that SAS IML doesn't really have very much support for tables as a data structure. They were recently added, and mostly added just to make it easier to import datasets into IML from SAS without a lot of trouble. However, you can't treat them like Pandas data frames or similar; they're really just data storage devices that you need to extract things from. You need to move data into matrices to actually use them.

Here's how I would directly translate your nonfunctional code into functional code. Again, please remember this is probably not a good way to do this - matrices have a lot of features that make them good at this sort of thing, if you use them right, and you probably don't need to use a DO loop to iterate here - you should rather use matrix multiplication to do what you want. That's really the point of using IML; if you're just iterating, then use base SAS to do this, it's much easier to write the same program in base SAS (or, even better, use the imputation procedures if you have them licensed).

data table_a;
input Obs Barcode   date  :$6.  sales  Position;
datalines;
1    21220000000    201901         .         1
2    21220000000    201902       311         2
3    21220000000    201903       349         3
4    21220000000    201904       360         4
5    21220000000    201905       380         5
6    21220000000    201906       440         6
7    21220000000    201907       360         7
8    21220000000    201908       390         8
9    21220000000    201909       410         9
10   21220000000    201910       520        10
11   21220000000    201911       410        11
12   21220000000    201912       390        12
13   31350000000    201901       360         1
14   31350000000    201902        .          2
24   31350000000    201912        .          12
25   45480000000    201901       310         1     
26   45480000000    201902         .         2
;;;;
run;

proc iml;
t_a= TableCreateFromDataSet("work","table_a");
call TablePrint(t_a); 

sales = TableGetVarData(t_a,4);
barcode = TableGetVarData(t_a,2);
do i =1 to nrow(sales);
     if missing(sales[i])  then do;    *if the sales value is missing, then ...;
      if i <= (nrow(sales) - 2) then do;    *make sure we are not going over the total;     
        if all(j(2,1,barcode[i])=barcode[i+1:i+2]) then do;  *and see if the rows are all the same barcode;
            sales[i] = mean(sales[i+1:i+2]);   *compute the mean!;
        end;
       end;
     end;
end;
      
call TableAddVar(t_a,'sales_i',sales);  *Put the matrix back in the table;

call TablePrint(t_a);  *Take a peek!;
quit;

What I do first is extract the Barcode and Sales columns into matrices. Barcode is to check to make sure we're imputing from the same ID. Then, we check to see if sales is missing for that iteration, and further make sure we're not on the last two iterations (or it'll give an out-of-range error). Lastly, we compare the barcode with the next two barcodes and make sure they're the same. (The way I do that is pretty silly, honestly, but it's the quickest way I can think of.) If those all pass, then we calculate the mean.

Finally, we add the matrix back on the t_a table, and you can export it at your leisure back to a SAS dataset, or do whatever it is you want to do with it!

Again - this isn't really a good way to do this, it's more a direct answer of "what is wrong with your code". Find a better solution to imputation than this!