1
votes

I have 2 datasets, one large and one small. While functionality is most important here, efficiency matters as well, because the large dataset could be potentially tens of millions of records. Let's call my large dataset "Transactions" and my small one "Prices." Here's what I'm trying to do. In the transactions file, there are a bunch of values for 'Store'. For each 'Store', I would like to create a hash table of relevant 'Product', and for each unique 'Saledate', pull all product 'Price' into the output table, not just for the associated 'Product' but for ALL 'Products' found in the Transactions dataset for that 'Store'.

Here's a sample of the "Transactions" dataset:

Store   Product SaleDate  Price
A       apple   1/1/2011  1.05
A       apple   1/3/2011  1.02
A       apple   1/4/2011  1.07

A       pepper  1/2/2011  0.73
A       pepper  1/3/2011  0.75
A       pepper  1/6/2011  0.79

And here's a sample of the "Prices" dataset:

        Product Saledate    Price
        apple   1/1/2011    1.05  
        apple   1/2/2011    1.06
        apple   1/3/2011    1.02
        apple   1/4/2011    1.07
        ...

        pepper  1/1/2011    0.74
        pepper  1/2/2011    0.73
        pepper  1/3/2011    0.75
        pepper  1/4/2011    0.75
        pepper  1/5/2011    0.75    
        pepper  1/6/2011    0.79
        ...

        mango   1/1/2011    2.40
        mango   1/2/2011    2.42
        ...

And here's the desired output for this scenario(the asterisk denotes that the price was inserted from "Prices" into "Transactions" via lookup).

 Store  Product Saledate    Price
 A      apple   1/1/2011    1.05    
 A      pepper  1/1/2011    0.74 *
 A      apple   1/2/2011    1.06 *
 A      pepper  1/2/2011    0.73
 A      apple   1/3/2011    1.02
 A      pepper  1/3/2011    0.75
 A      apple   1/4/2011    1.07
 A      pepper  1/4/2011    0.75 *
 A      apple   1/6/2011    1.10 *
 A      pepper  1/6/2011    0.79

Basically, in pseudo-code, the idea is do loop through each store, look-up its list of distinct products and saledates, then find corresponding prices for those products and saledates in the prices data and insert them if they are not already there. In the example, since apple has a saledate of 1/1 in "Transactions", I need the saledate for pepper as well (from "Prices"). The same applies for 1/2, except vice-versa since the price is already there for pepper, but not for apple. 1/5 has a record in "Prices", but it's not needed, since it doesn't occur for either apple/pepper in "Transactions." For another store, different products exist, so pepper is not relevant at all, but mango is.

I've tried several approaches and can't get around the "double" look-up hangup that I think is necessary.

Here is a link to a previous question/answer related to this new question. The answer provides sample code to create the dummy tables. https://stackoverflow.com/a/36961795/214994

1

1 Answers

1
votes

I think you can accomplish what you are looking for with a series of SQL statements.

data trans;
format Store $1. Product $6. SaleDate mmddyy10. price best.;
informat SaleDate mmddyy10.;
input Store $ Product $ SaleDate price;
datalines;
A       apple   1/1/2011  1.05
A       apple   1/3/2011  1.02
A       apple   1/4/2011  1.07
A       pepper  1/2/2011  0.73
A       pepper  1/3/2011  0.75
A       pepper  1/6/2011  0.79
B       apple   1/1/2011  1.05
B       apple   1/3/2011  1.02
B       apple   1/4/2011  1.07
B       mango  1/2/2011  2.42
B       mango  1/3/2011  2.43
B       mango  1/6/2011  2.46
;

data prices;
format Product $6. SaleDate mmddyy10. price best.;
informat SaleDate mmddyy10.;
input Product $ SaleDate price;
datalines;
apple   1/1/2011    1.05
apple   1/2/2011    1.06
apple   1/3/2011    1.02
apple   1/4/2011    1.07
apple   1/5/2011    1.10
apple   1/6/2011    1.15
pepper  1/1/2011    0.74
pepper  1/2/2011    0.73
pepper  1/3/2011    0.75
pepper  1/4/2011    0.75
pepper  1/5/2011    0.75
pepper  1/6/2011    0.79
mango   1/1/2011    2.40
mango   1/2/2011    2.42
mango   1/3/2011    2.43
mango   1/4/2011    2.44
mango   1/5/2011    2.45
mango   1/6/2011    2.46
;

proc sql noprint;
/*Store and Date Combinations*/
create table store_dates as
select distinct store, saledate
    from trans;

/*Store and Product Combinations*/
create table store_products as
select distinct store, product
    from trans;

/*Full Joins the combination tables and look up the price with a left join*/
create table want as
select a.store,
       b.product,
       a.saledate,
       c.price
    from store_dates as a
      full join
         store_products as b
        on a.store=b.store
      left join
         prices as c
      on b.product = c.product
     and a.saledate = c.saledate
    order by a.store, a.saledate, b.product;
quit;

This assumes that there are no differences in prices between TRANS and PRICES. If so, add another left join of TRANS and a coalesce() to handle.