4
votes

I have following dataset:

data work.dataset;
input a b c;
datalines;
27 93 71 
27 93 72
46 68 75
55 55 33
46 68 68
34 34 32
45 67 88
56 75 22
34 34 32
;
run;

I want to select all distinct records from first 2 columns, so I wrote:

proc sql;
create table work.output1 as
select distinct t1.a,
t1.b
from work.dataset t1;
quit;

But now I want to know what value of var c stands in previous set next to combination (var a, var b) seen in the output. Is there a way to find out? I tried following proc sort, but I don't know if it works the same way as selecting distinct records in proc sql.

proc sort data = work.dataset out = work.output2 NODUPKEY;
by a b;
run;

Thanks for help in advance.

3

3 Answers

3
votes

PROC SORT with NODUPKEY will always return the physical first record - ie, as you list the data, c=71 will be kept always. PROC SQL will not necessarily return any particular record; you could ask for min or max, but you could not guarantee the first record in sort order regardless of how you did the query; SQL will often resort the data as needed to accomplish the query as efficiently as possible.

They will be identical insomuch as they both return the same number of records, if that is your concern.

You cannot accomplish exactly the same thing in a straightforward manner in SQL; because SQL doesn't have a concept of row ordering, you would have to either have a method of choosing which c (max(c), min(c), etc.) or you would have to add a row counter and choose the lowest value of that.

For example:

data work.dataset;
input a b c;
rowcounter=_n_;
datalines;
27 93 71 
27 93 72
46 68 75
55 55 33
46 68 68
34 34 32
45 67 88
56 75 22
34 34 32
;
run;

proc sql;
select a,b,min(rowcounter*100+c)-min(rowcounter*100) as c
from work.dataset
group by a,b;
quit;

That's using a cheat (knowing that rowcounter*100 will always dominate the size of c); of course if your c doesn't have values appropriate for that, this won't work and you're better off merging it on separately.

If you are interested in the SQL solution, you may consider posting that explicitly as a separate question as the SQL-only folk will then answer it.

1
votes

NODUPKEY will return one observation for each key. In your example only one of the two observations with a=27 and b=93 will be kept. Either c=71 or c=72 will be lost.

The NODUPREC option will remove duplicate records. Both observations with a=27 and b=93 will be kept, but only one of the two with the values a=34, b=34 and c=32.

0
votes

Sql will not return a value for variable c in the above query, as it is not listed in the select statement. I think what you may be looking for is:

proc sql;
create table work.output1 as
select t1.a,
t1.b,
min(t1.c) as c
from work.dataset t1
group by a, b;
quit;

If you would like the maximum value of c then you can replace the function with max(t1.c) as c, or use any of the other sql functions in order to select your value. If you want to replicate PROC SORT nodupkey, and take the first value listed, you would need to use the function monotonic (I know... unsupported by SAS but it's there so whatever). Your code would now be:

proc sql;
create table work.output1 as
select monotonic() as rownum,
t1.a,
t1.b,
t1.c
from work.dataset t1
group by a, b
having calculated(rownum) = min(calculated rownum);
quit;