simple random sampling while pulling data from warehouse(oracle engine) using proc sql in sas

Question

I need to pull humongous amount of data, say 600-700 variables from different tables in a data warehouse...now the dataset in its raw form will easily touch 150 gigs - 79 MM rows and for my analysis purpose I need only a million rows...how can I pull data using proc sql directly from warehouse by doing simple random sampling on the rows.

Below code wont work as ranuni is not supported by oracle

    proc sql outobs =1000000;
    select * from connection to oracle(
    select * from tbl1 order by ranuni(12345);
    quit;

How do you propose I do it

Oracle provides some solid tools to perform sampling. Have you looked at the sample sql clause (blogs.oracle.com/datamining/to-sample-or-not-to-sample-part-2). — Patrick Bacon
btw, your question is very similar to this question, stackoverflow.com/questions/9868409/… — Patrick Bacon
Do you need exactly one million rows or just approximately one million rows? — APC
@APC one million is not fixed but whatever number I specify should be exact — Rohan
Possible duplicate of How to get records randomly from the oracle database? — david25272

Patrick Bacon Patrick Bacon · Accepted Answer · 2017-08-05T07:23:58

Use the DBMS_RANDOM Package to Sort Records and Then Use A Row Limiting Clause to Restrict to the Desired Sample Size

The dbms_random.value function obtains a random number between 0 and 1 for all rows in the table and we sort in ascending order of the random value.

Here is how to produce the sample set you identified:

    SELECT
    *
FROM
    (
        SELECT
            *
        FROM
            tbl1
        ORDER BY dbms_random.value
    )
FETCH FIRST 1000000 ROWS ONLY;

To demonstrate with the sample schema table, emp, we sample 4 records:

   SCOTT@DEV> SELECT
  2      empno,
  3      rnd_val
  4  FROM
  5      (
  6          SELECT
  7              empno,
  8              dbms_random.value rnd_val
  9          FROM
 10              emp
 11          ORDER BY rnd_val
 12      )
 13  FETCH FIRST 4 ROWS ONLY;
EMPNO  RND_VAL
7698   0.06857749035643605682648168347885993709
7934   0.07529612360785920635181751566833986766
7902   0.13618520865865754766175030040204331697
7654   0.14056380246495282237607922497308953768


SCOTT@DEV> SELECT
  2      empno,
  3      rnd_val
  4  FROM
  5      (
  6          SELECT
  7              empno,
  8              dbms_random.value rnd_val
  9          FROM
 10              emp
 11          ORDER BY rnd_val
 12      )
 13  FETCH FIRST 4 ROWS ONLY;
EMPNO  RND_VAL
7839   0.00430658806761508024693197916281775492
7499   0.02188116061148367312927392115186317884
7782   0.10606515700372416131060633064729870016
7788   0.27865276349549877512032787966777990909

With the example above, notice that the empno changes significantly during the execution of the SQL*Plus command.

The performance might be an issue with the row counts you are describing.

EDIT:

With table sizes in the order of 150 gigs - 79 MM, any sorting would be painful.

If the table had a surrogate key based on a sequence incremented by 1, we could take the approach of selecting every nth record based on the key.

e.g.

    --scenario n = 3000

 FROM
    tbl1
WHERE
    mod(table_id, 3000) = 0;

This approach would not use an index (unless a function based index is created), but at least we are not performing a sort on a data set of this size.

I performed an explain plan with a table that has close to 80 million records and it does perform a full table scan (the condition forces this without a function based index) but this looks tenable.

simple random sampling while pulling data from warehouse(oracle engine) using proc sql in sas

2 Answers