5
votes

I am in the midst of re-writing a SAS program I normally use into R. Many statements in the SAS program are written with proc sql, so for ease of programming I am replicating those statements with sqldf in R. All of the sql code works the same with sqldf as it does in SAS, except for this statement:

SAS:

proc sql;
create table merged32 as
select max(ctf) as ctf,
       max(ctms) as ctms,
       max(site_name) as site_name,
       provider_id, npi,
       first_name, middle_name, last_name, specialty,
       address_1, city, state, site_ct, zip, site_key
    from merged2
    group by 9,10,11,12,14,15;
quit;
run;

sqldf:

sqldf("select max(ctf) as ctf,
              max(ctms) as ctms,
              max(site_name) as site_name,
              provider_id, npi,
              first_name, middle_name, last_name, specialty,
              address_1, city, state, site_ct, zip, site_key
       from merged2
       group by 9,10,11,12,14,15")

In SAS, it returns a dataset with 1878 rows; in sqldf, a dataframe with 1375.

Are there any major differences between proc sql and sqldf that could be causing this, or in general that need to be considered? I didn't provide data because 1) the datasets are huge, and 2) I'm more just interested in knowing the differences between the two sql systems.

2
Not sure about sqldf, but for meaningful results, you should group by all your key variables in SQL.stevepastelan
I never knew why I tend to put the "key" variables first, and the aggregates last. now I know...wildplasser
shoot, didn't even notice the missing key variables in the list. Thanks guysuser1445246
@user1445246, has this question been answered through the inputs in the comments? If so, please feel free to add your solution as an answer and mark it as accepted to help remove it from the "unanswered question" queue.A5C1D2H2I1M1N2O1R2T1

2 Answers

2
votes

Not sure about sqldf, but for meaningful results, you should group by all your key variables in SQL.

0
votes

The results in SAS maybe have duplicated records, while those in R do not