3
votes

In SAS there's an method of creating Library (using LIBNAME). This is helpful as when we have to do long data processing, we don't change always the dataset name. So, if we want to use a dataset again, without changing the name, we can put in a library. So, even if the dataset name are same, but since they are in different libraries, we can work on them together.

My question is there any such option in R that can create Library (or separate folder within R) so that we can save our data there?

Here's the example:

Suppose I've a dataset "dat1". I summarize variables in dat1 var1 & var2 for var3.

proc summary data=dat1 nway missing;
  var var1 var2;
  class var3;
  output out=tmp.dat1 (drop = _freq_ _type_) sum = ;
  run;

Then I merged dat1 with dat2, which is another dataset.Both dat1 & dat2 has common variable var3, with which I merged. I created new dataset dat1 again.

proc sql;
   create table dat1 as
   select a.*,b.*
   from dat1 a left join tmp.dat2 b
   on a.var3=b.var3;
  quit;

Now, I'm again summarizing dataset dat1 after merging to check if the values of var1 & var 2 remain the same before & after merging.

proc summary data=dat1 nway missing;
  var var1 var2;
  class var3;
  output out=tmp1.dat1 (drop = _freq_ _type_) sum = ;
  run;

The equivalent code in R will be

dat3<-ddply(dat1,.(var3),summarise,var1=sum(var1,na.rm=TRUE),var2=sum(var2,na.rm=TRUE))

dat1<-sqldf("select a.*,b.* from dat1 a left join dat2 b on a.var3=b.var3")

dat4<-ddply(dat1,.(var3),summarise,var1=sum(var1,na.rm=TRUE),var2=sum(var2,na.rm=TRUE))

In case of SAS I used just 2 dataset name. But in case of R, I'm using 4 dataset name. So, if I'm writing 4000 line code for data processing, having too many dataset name sometimes become overwhelming. In sas it became easy to have same dataset name as I'm using 2 libraries tmp, tmp1 other than the default work library.

In SAS, library is defined as:

LIBNAME tmp "directory_path\folder_name";

In this folder, dat1 will be stored.

4
This question may make sense to a SAS user, but it makes no sense to the rest of us. Why don't you explain what you want out of R and how the current way you do things is lacking? Perhaps with a reproducible example?Ari B. Friedman
To save datasets please see ?save (?load to load them).sgibb
Your problem is you are writing 4000 line scripts. This may not be a problem in SAS where anything over five lines is confusing already, but in R you should never write anything more than about ten lines without thinking "hey, this should be wrapped up in a function."Spacedman
Thanks Spacedman for your comment. My problem is not with writing 4K line script. If you see the example, in SAS I used around 20 lines. But in R I did the same thing in just 3 lines. So R is more efficient when writing code. But in case of R I've to define more data names, which don't have to do in SAS just because it has library option. I just want something which is equivalent of library.Beta
it sounds like you might want to work with different, named environmentsGlen_b

4 Answers

6
votes

From what I can gather from the SAS onlinehelp, a SAS library is a set of datasets that is stored in a folder, and can be referenced as a unit. The equivalent in R would be to store the R objects you want to save using save:

save(obj1, obj2, etc, file = "stored_objects.rda")

Loading the objects can be done using load.

edit: I dont really see why having an additional object or two is so much of a problem. However, if you want to reduce tge amount of object just put your results in a list.

5
votes

There are two separate aspects of SAS's libraries which (it seems) you are interested in.

  • Specification of the directory in which data files are stored
  • Ability to easily point an analysis to a different set of identically named datasets by just specifying the different location

Taking these in that order.

The problem with answering the first is that R and SAS have different models for how data is stored. R stores data in memory, organized in environments arranged in a particular search order. SAS stores data on disk and the names of datasets correspond to file names within a specified directory (there likely is caching in memory for optimization, but conceptually this is how data is stored). R can store (sets of) objects in a file on disk using save() and bring them back into memory using load(). The filename and directory can be specified in those function calls (hence Paul's answer). You could have several .RData files, each containing objects named dat1, dat2, etc. which can be loaded prior to running an analysis and the results can be written out to (other) .RData files.

An alternative to this would be using one of the extensions which give data types which are backed by disk storage instead of memory. I've not had experience with any of them to talk about how well they would work in this situation, but that is an option. [Edit: mnel's answer has a detailed example of just this idea.]

Your second part can be approached different ways. Since R uses in-memory data, the answers would focus around arranging different environments (each of which can contain different but identically named data sets) and controlling which one gets accessed via attach()ing and detach()ing the environments from the search path (what Glen_b's answer gets toward). You still don't have the disk backing of the data, but that is the previous problem.

Finally, @joran's admonition is relevant. The solution to the problem of performing a set of tasks on potentially different (but related) sets of data in R is to write a function to do the work. The function has parameters. Within the function, the parameters are referred to by the names given in the argument list. When the function is called, which particular set of data is sent to it specified by the function call; the names inside and outside the function need not have anything to do with each other. The suggestions about storing the multiple sets of data in a list are implicitly approaching the problem this way; the function is called for each set of data in the list in turn. Names don't matter, then.

4
votes

Here is an example using the SOAR package and named environments

To quote from the vignette

Objects need not be always held in memory. The function save may be used to save objects on the disc in a file, typically with an .RData extension. The objects may then be removed from memory and later recalled explicitly with the load function.

The SOAR package provides simple way to store objects on the disc, but in such a way that they remain visible on the search path as promises, that is, if and when an object is needed again it is automatically loaded into memory. It uses the same lazy loading mechanism as packages, but the functionality provided here is more dynamic and exible

It will be useful to read the whole vignette

library(SOAR)
library(plyr)
library(sqldf)
set.seed(1)
# create some dummy data create a named envirment
tmp <- new.env(parent = .GlobalEnv)
dat1 <- data.frame(var1 = rnorm(50), var2 = sample(50, replace = TRUE), var3 = sample(letters[1:5], 
    50, replace = TRUE))
tmp$dat1 <- ddply(dat1, .(var3), summarise, var1 = sum(var1, na.rm = TRUE), 
    var2 = sum(var2, na.rm = TRUE))
tmp$dat2 <- data.frame(Var3 = sample(letters[1:5], 20, replace = TRUE), Var4 = 1:20)
# store as a SOAR cached object (on disc)
Store(tmp, lib = "tmp")

# replace dat1 within the global enviroment using sqldf create a new
# environment to work in with the correct version of dat1 and dat2
sqlenv <- tmp
sqlenv$dat1 <- dat1

dat1 <- sqldf("select a.*,b.* from dat1 a left join dat2 b on a.var3=b.var3", 
    envir = sqlenv)

# create a new named enviroment tmp1
tmp1 <- new.env(parent = .GlobalEnv)

tmp1$dat1 <- ddply(dat1, .(var3), summarise, var1 = sum(var1, na.rm = TRUE), 
    var2 = sum(var2, na.rm = TRUE))

# store using a SOAR cache
Store(tmp1, lib = "tmp")


tmp1$dat1

##   var3   var1 var2
## 1    a  1.336  378
## 2    b  8.514 1974
## 3    c  5.795  624
## 4    d -8.828  936
## 5    e 20.846 1490

tmp$dat1

##   var3    var1 var2
## 1    a  0.4454  126
## 2    b  1.4190  329
## 3    c  1.9316  208
## 4    d -2.9427  312
## 5    e  4.1691  298

I'm not sure you should expect tmp1$dat1 and tmp$dat1 to be identical (given my example anyway)

2
votes

Named environments are one of a number of ways of achieving what it sounds like you want.

Personally, if there aren't a lot of different data frames or lists, I'd lean toward organizing it other ways, such as inside either data frames or lists, depending on how your data is structured. But if each thing consists of many different kinds of data and functions, environments may be significantly better. They're described in the help, and a number of posts to r-blogs discuss them.

But on reflection, R-Studio projects may be closer to the way you're thinking about the problem (and if you're not using R-Studio already, I highly recommend it). Have a look at how projects work.