How do I get Hadoop to accept a SAS pass-through query with a compound WHERE clause?

Question

I am attempting to narrow-down the resulting SAS dataset of a pass-through query (i.e., using PROC SQL, not LIBNAME) to Hadoop by sending additional conditions in the inner (Hadoop-side) WHERE clause.

For example, the code that works looks like this (the dtpart macro variable is defined earlier in order to down-select to a single partition in the HDFS) :

proc sql;
   connect to hadoop (server="&srvid" port=10000 user="&uid" pw="&passwd" schema=default );
   create table work.creative_lkup as
   select 
      advertiser_id,
      creative_id,
      creative,
      rendering_id,
      creative_type, 
      input("&dtpart.", yymmdd10.) as last_updt_dt format=mmddyy10.
   from connection to hadoop (
      select
       `advertiser id` as advertiser_id,
       `creative id` as creative_id,
       creative,
       `rendering id` as rendering_id,
       `creative type` as creative_type
      from default.match_table_creatives 
      where date_partition = "&dtpart." 
   ) 
   WHERE advertiser_id = '12345';
disconnect from hadoop;
quit;

NOTE: As you can see, the field names in Hadoop are not SAS-standard (hence the back-ticks around the names containing spaces) and for some tables, the field and table names get really lengthy. This makes it impractical to simply use a LIBNAME statement for my Hadoop connection, and therefore I must use a pass-through query in PROC SQL using the CONNECT TO.

What I'd like to do is move the "WHERE advertiser_id = '12345'" portion inside the pass-through to Hadoop. Something like this:

proc sql;
   connect to hadoop (server="&srvid" port=10000 user="&uid" pw="&passwd" schema=default );
   create table work.creative_lkup as
   select 
      advertiser_id,
      creative_id,
      creative,
      rendering_id,
      creative_type, 
      input("&dtpart.", yymmdd10.) as last_updt_dt format=mmddyy10.
   from connection to hadoop (
      select
       `advertiser id` as advertiser_id,
       `creative id` as creative_id,
       creative,
       `rendering id` as rendering_id,
       `creative type` as creative_type
      from default.match_table_creatives 
      where date_partition = "&dtpart." 
         and `advertiser id` = '12345'
   );
disconnect from hadoop;
quit;

Unfortunately, when I do this, I get a nice vague error from Hadoop passed back to SAS:

ERROR: Prepare error: Error while processing statement: FAILED: Execution Error, return code 1 from 
   org.apache.hadoop.hive.ql.exec.mr.MapRedTask

My question is: are there any settings, options, or other items that I could include in my PROC SQL statement in order to allow proper WHERE clause utilization?

It appears that the problem only occurs when there is an AND in the WHERE clause. I've read a few things about SPDE (the ACCELWHERE= option sounds promising), but I have no idea how to go about using such options in PROC SQL and the reference materials I've found online seem to only point to use in a LIBNAME statement.

Thank you.

If you set `option validvarname=v7' SAS should do some of that automatic name conversion for you, but the code in the connection statement needs to be valid Hadoop code, not SAS code. From what I see it looks the same, except you should likely be using the Hadoop name. Is it a number in Hadoop, ie does it need the quotes? — Reeza
I will try the option you mentioned and let you know if that worked. As for the ID numbers, the way the table is defined, every field (including the "date partition" field) is a string. (Just the way the team who built these tables decided to define them.) — PlanetJuggler
Just guessing, but I'd be a bit suspicious of having two string literals, one with double quotes and one with single quotes. Have you tried using double quotes on both? Many languages aren't as flexible as SAS in this regard. — Chris Long
double- or single-quotes makes no difference in Hadoop. I've submitted the inner query directly into Hive, with the WHERE...AND, and it runs fine. It's just when I submit it through SAS that it has a problem. — PlanetJuggler
What if you run it without macro variables, but hardcode the values, does it work then? — Reeza

PlanetJuggler PlanetJuggler · Accepted Answer · 2017-08-04T20:34:20

It appears that the problem is that the user specified in the connect string (held in the SAS macro variable &uid) does not actually exist in Hadoop.

The macro variable &uid is holding the name of a generic userID (different names for the DEV and PROD environments - hence the need for a macro variable). It was assumed that the user existed in Hadoop, since I was able to use this userID to connect to Hadoop from SAS and retrieve records via simple queries.

Apparently, this is not the case.

Without a user existing in the Hadoop environment, there is no home folder in the HDFS. Our Hadoop setup will still process "simple" queries (i.e., no transforms on fields; only WHERE condition can be against the partition field; no joins; etc.), because no MapReducer task needs to be built to get that data from the underlying flat files in HDFS. It is only when there are additional WHERE conditions (i.e., other than the partition field) that a MapReducer task needs to be created. The task - and any temp files needed by the task - get created in the home folder of the user running the query. Since the home folder doesn't exist for this user, the MapReducer object throws an error (albeit without generating a log file in Hive, because there is nowhere to generate the log file or even the task that would create such a file).

It's goofy that it would even allow data to be selected without a valid userID, but that is our current setup.

As such, to test this theory, I ran the following code (NOTE: I used my username instead of the macro variable).

proc sql;
connect to hadoop (server="&srvid" port=10000 user="myuserid" pw="unnecessary" schema=default);
  create table work.creative_lkup2 as
  select 
     advertiser_id,
     creative_id,
     creative,
     rendering_id,
     creative_type, 
     input("&dtpart.", yymmdd10.) as last_updt_dt format=mmddyy10.
  from connection to hadoop (
select
      c.`advertiser id` as advertiser_id,
      c.`creative id` as creative_id,
      c.creative,
      c.`rendering id` as rendering_id,
      c.`creative type` as creative_type
from default.match_table_creatives c
where c.date_partition = "&dtpart." and c.`advertiser id` = "12345" 
   );
disconnect from hadoop;
quit;

This returned approx. 80k records (10 seconds of CPU time, but it took almost 10 minutes due to IO from AWS).

As a thank you to all who viewed this issue and responded, I'd like to offer an additional lesson learned from this that may be helpful to you:

Since the fields in Hadoop are all defined as String, SAS has to assume that the fields are the longest possible character length: 32,767. Therefore, the amount of data (mostly blanks) was excessive, contributing to the slow transfer rate.

The work-around we came up with involved using EXECUTE (...) AS HADOOP statements to create and populate a temp table that has correctly sized VARCHAR fields (already down-selected to the data we desire) before we ingest it into SAS.

proc sql;
   connect to hadoop (server="&srvid" port=10000 user="myuserid" pw="doesnt_matter" schema=default);
   execute (
      create table if not exists default.test_lkup_cre (
               advertiser_id Varchar(10),
               creative_id Varchar(10),
               creative Varchar(200),
               rendering_id Varchar(10),
               creative_type Varchar(16)
   )
) by hadoop;
   execute (
      insert into table default.test_lkup_cre
         select
               c.`advertiser id` as advertiser_id,
               c.`creative id` as creative_id,
               c.creative,
               c.`rendering id` as rendering_id,
               c.`creative type` as creative_type
         from default.match_table_creatives c
         where c.date_partition = "&dtpart." and c.`advertiser id` = "12345"
) by hadoop;
   create table work.creative_lkup3 as
   select 
     advertiser_id,
     creative_id,
     creative,
     rendering_id,
     creative_type, 
     input("&dtpart.", yymmdd10.) as last_updt_dt format=mmddyy10.
   from connection to hadoop (
      select
        advertiser_id,
        creative_id,
        creative,
        rendering_id,
        creative_type
      from default.test_lkup_cre
   );
   disconnect from hadoop;
quit;

This returned the same ~80k records in just 46 seconds.

Hopefully, this helps someone else out there as well.

How do I get Hadoop to accept a SAS pass-through query with a compound WHERE clause?

1 Answers