3
votes

Here is a sample code that was derived from actual application. There are two datasets - "aa" for a query and "bb" for subquery. Column "m" from datasets "aa" matches column "y" from datasets "bb". Also, there is "yy" column on "aa" table has a value of 30. Column "m" from datasets "aa" contains value "30" in one of its rows, and column "y" from datasets "bb" does not. First proc sql uses values from "y" column of "bb" table to subset table "aa" based on matching values in column "m". It is a correct query and produces results as expected. Second proc sql block has column "y" intentionally misspelled as "yy" in subquery in a row that stars with where statement. Otherwise the whole proc sql block is the same as the first one. Given that there is no column "yy" on dataset bb, I would expect an error message to appear and the whole query to fail. However, it does return one row without failing or error messages. Closer look would suggest that it actually uses "yy" column from table "aa" (see tree in the log output). I do not think this is a correct behavior. If you would have some comments or explanations, I would greatly appreciate it. Otherwise, I maybe should report it to SAS as a bug. Thank you!

Here is the code:

options 
msglevel = I
; 
data aa;
    do i=1 to 20;
    m=i*5;
    yy=30;
    output;
    end;
run;

data bb;
    do i=10 to 20;
    y=i*5;
    output;
    end;
run;
option DEBUG=JUNK ;

/*Correct sql command*/
proc sql _method
_tree
;
    create table cc as
        select *
        from aa
        where m in (select y from bb)
;quit;


/*Incorrect sql command - column "yy" in not on "bb" table"*/
proc sql _method
_tree;
    create table dd as
        select *
        from aa
        where m in (select yy from bb)
;quit;

Here is log with sql tree:

119      options
120      msglevel = I
121      ;
122      data aa;
123          do i=1 to 20;
124          m=i*5;
125          yy=30;
126          output;
127          end;
128      run;

NOTE: The data set WORK.AA has 20 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


129
130      data bb;
131          do i=10 to 20;
132          y=i*5;
133          output;
134          end;
135      run;

NOTE: The data set WORK.BB has 11 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


136      option DEBUG=JUNK ;
137
138      /*Correct sql command*/
139      proc sql _method
140      _tree
141      ;
142          create table cc as
143              select *
144              from aa
145              where m in (select y from bb)
146      ;

NOTE: SQL execution methods chosen are:

      sqxcrta
          sqxfil
              sqxsrc( WORK.AA )

NOTE: SQL subquery execution methods chosen are:

          sqxsubq
              sqxsrc( WORK.BB )

Tree as planned.
                               /-SYM-V-(aa.i:1 flag=0001)
                     /-OBJ----|
                    |         |--SYM-V-(aa.m:2 flag=0001)
                    |          \-SYM-V-(aa.yy:3 flag=0001)
           /-FIL----|
          |         |                    /-SYM-V-(aa.i:1 flag=0001)
          |         |          /-OBJ----|
          |         |         |         |--SYM-V-(aa.m:2 flag=0001)
          |         |         |          \-SYM-V-(aa.yy:3 flag=0001)
          |         |--SRC----|
          |         |          \-TABL[WORK].aa opt=''
          |         |          /-SYM-V-(aa.m:2)
          |          \-IN-----|
          |                   |                              /-SYM-V-(bb.y:2 flag=0001)
          |                   |                    /-OBJ----|
          |                   |          /-SRC----|
          |                   |         |          \-TABL[WORK].bb opt=''
          |                    \-SUBC---|
 --SSEL---|


NOTE: Table WORK.CC created, with 11 rows and 3 columns.

146!      quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.03 seconds
      cpu time            0.03 seconds


147
148
149      /*Incorrect sql command - column "yy" in not on "bb" table"*/
150      proc sql _method
151      _tree;
152          create table dd as
153              select *
154              from aa
155              where m in (select yy from bb)
156      ;

NOTE: SQL execution methods chosen are:

      sqxcrta
          sqxfil
              sqxsrc( WORK.AA )

NOTE: SQL subquery execution methods chosen are:

          sqxsubq
              sqxreps
                  sqxsrc( WORK.BB )

Tree as planned.
                               /-SYM-V-(aa.i:1 flag=0001)
                     /-OBJ----|
                    |         |--SYM-V-(aa.m:2 flag=0001)
                    |          \-SYM-V-(aa.yy:3 flag=0001)
           /-FIL----|
          |         |                    /-SYM-V-(aa.i:1 flag=0001)
          |         |          /-OBJ----|
          |         |         |         |--SYM-V-(aa.m:2 flag=0001)
          |         |         |          \-SYM-V-(aa.yy:3 flag=0001)
          |         |--SRC----|
          |         |          \-TABL[WORK].aa opt=''
          |         |          /-SYM-V-(aa.m:2)
          |          \-IN-----|
          |                   |                              /-SYM-A-(#TEMA001:1 flag=0035)
          |                   |                    /-OBJ----|
          |                   |          /-REPS---|
          |                   |         |         |--empty-
          |                   |         |         |--empty-
          |                   |         |         |          /-OBJ----|
          |                   |         |         |--SRC----|
          |                   |         |         |          \-TABL[WORK].bb opt=''
          |                   |         |         |--empty-
          |                   |         |         |--empty-
          |                   |         |         |                    /-SYM-A-(#TEMA001:1 flag=
0035)
          |                   |         |         |          /-ASGN---|
          |                   |         |         |         |          \-SUBP(1)
          |                   |         |          \-OBJE---|
          |                    \-SUBC---|
          |                              \-SYM-V-(aa.yy:3)
 --SSEL---|


NOTE: Table WORK.DD created, with 1 rows and 3 columns.

156!      quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds

Here are datasets:

aa:
i   m   yy
1   5   30
2   10  30
3   15  30
4   20  30
5   25  30
6   30  30
7   35  30
8   40  30
9   45  30
10  50  30
11  55  30
12  60  30
13  65  30
14  70  30
15  75  30
16  80  30
17  85  30
18  90  30
19  95  30
20  100 30

bb:
i   y
10  50
11  55
12  60
13  65
14  70
15  75
16  80
17  85
18  90
19  95
20  100
2

2 Answers

1
votes

I agree, this looks pretty weird and may well be a bug. I was able to reproduce this from the code you provided in SAS 9.4 and in SAS 9.1.3, which would make it at least ~12 years old.

In particular, I'm interested in this bit of the output you got from the _method option when creating the DD table but not when creating the CC table:

NOTE: SQL subquery execution methods chosen are:

          sqxsubq
              sqxreps   <--- What is this doing?
                  sqxsrc( WORK.BB )

Similarly, the corresponding section from the _tree output is highly obscure:

          |                   |                              /-SYM-A-(#TEMA001:1 flag=0035)
          |                   |                    /-OBJ----|
          |                   |          /-REPS---|
          |                   |         |         |--empty-
          |                   |         |         |--empty-
          |                   |         |         |          /-OBJ----|
          |                   |         |         |--SRC----|
          |                   |         |         |          \-TABL[WORK].bb opt=''
          |                   |         |         |--empty-
          |                   |         |         |--empty-
          |                   |         |         |                    /-SYM-A-(#TEMA001:1 flag=    0035)
          |                   |         |         |          /-ASGN---|
          |                   |         |         |         |          \-SUBP(1)
          |                   |         |          \-OBJE---|
          |                    \-SUBC---|
          |                              \-SYM-V-(aa.yy:3)

I have never seen sqxreps or reps in the respective bits of output before. Neither of them is listed in any of the papers I was able to find based on a brief bit of googling (in fact, this question is currently the only hit on Google for sas + sqxreps):

http://support.sas.com/resources/papers/proceedings10/139-2010.pdf http://www2.sas.com/proceedings/sugi30/101-30.pdf

Quoting the first of these:

Codes    Description
sqxcrta  Create table as Select
Sqxslct  Select
sqxjsl   Step loop join (Cartesian)
sqxjm    Merge join
sqxjndx  Index join
sqxjhsh  Hash join
sqxsort  Sort
sqxsrc   Source rows from table
sqxfil   Filter rows
sqxsumg  Summary stats with GROUP BY
sqxsumn  Summary stats with no GROUP BY

Based on a bit of quick testing, this seems to happen regardless of the variable and tables names used, provided that the variable name from AA is repeated multiple times in the subquery referencing table BB. It also happens if you have a variable named e.g. YYY in AA but one named YY in BB, or more generally whenever you have a variable in BB whose name is initially the same as the name of the corresponding variable in AA but then continues for one or more characters.

From this, I'm guessing at some point in the SQL parser, someone used a like operator rather than checking for equality of variable names, and somehow as a result this syntax is triggering an undocumented or incomplete 'feature' in proc sql.

An example of the more general case:

options 
msglevel = I
; 
data aa;
    do i=1 to 20;
    m=i*5;
    myvar_plus_suffix=30;
    output;
    end;
run;

data bb;
    do i=10 to 20;
    myvar=i*5;
    output;
    end;
run;
option DEBUG=JUNK ;

/*Incorrect sql command - column "yy" in not on "bb" table"*/
proc sql _method
_tree;
    create table dd as
        select *
        from aa
        where m in (select myvar_plus_suffix from bb)
;quit;
1
votes

Here is a response from SAS support. What you are seeing is related to column scoping in PROC SQL.

PROC SQL supports Corellated Subqueries. A Correlated Subquery references a column in the "outer" table which can then be compared to columns in the "inner" table. PROC SQL does not require that a fully qualified column name is used. As a result, if it sees a column in the subquery that does not exist in the inner table (the table referenced in the subquery), it looks for that column in the "outer" table and uses the value if it finds one.

If a fully qualified column name is used, the error you are expecting will occur such as the following:

proc sql;
create table dd as
select *
from aa as outer
where outer.m in (select inner.yyy from bb as inner);
quit;