44
votes

To provide a reproducible example of an approach, a data set must often be provided. Instead of building an example data set, I wish to use some of my own data. However this data can not be released. I wish to replace variable (column) names and factor levels with uninformative place holders (eg. V1....V5, L1....L5).

Is an automated way to do this available?

Ideally, this would be done in R, taking in a data.frame and producing this anonymous data.frame.

With such a data set, simply search and replace variable names in your script and you have a publicly releasable reproducible example.

Such a process may increase the inclusion of appropriate data in reproducible examples and even the inclusion of reproducible examples in questions, comments and bug reports.

3
I'd suggest it may also be important to anonymize the data itself, perhaps by rescaling by (x-mean)/sd or to a unif(0,1), depending on the data set. One would have to keep in mind the purpose of the data set, as either of these specific suggestions could hide important features.Aaron left Stack Overflow
Rescaling should work. Maybe just normalization. I still need the structure to be present.Etienne Low-Décarie
I added a solution that avoids loops, tags levels with variable names and avoid loops.Etienne Low-Décarie

3 Answers

39
votes

I don't know whether there was a function to automate this, but now there is ;)

## A function to anonymise columns in 'colIDs' 
##    colIDs can be either column names or integer indices
anonymiseColumns <- function(df, colIDs) {
    id <- if(is.character(colIDs)) match(colIDs, names(df)) else colIDs
    for(id in colIDs) {
        prefix <- sample(LETTERS, 1)
        suffix <- as.character(as.numeric(as.factor(df[[id]])))
        df[[id]] <- paste(prefix, suffix, sep="")
    }
    names(df)[id] <- paste("V", id, sep="")
    df
}

## A data.frame containing sensitive information
df <- data.frame(
    name = rep(readLines(file.path(R.home("doc"), "AUTHORS"))[9:13], each=2),
    hiscore = runif(10, 99, 100),
    passwd = replicate(10, paste(sample(c(LETTERS, letters), 9), collapse="")))

## Anonymise it
df2 <- anonymiseColumns(df, c(1,3))

## Check that it worked
> head(df, 3)
           name  hiscore    passwd
1 Douglas Bates 99.96714 ROELIAncz
2 Douglas Bates 99.07243 gDOLNMyVe
3 John Chambers 99.55322 xIVPHDuEW    

> head(df2, 3)
  name hiscore  V3
1   Q1 99.96714 V8
2   Q1 99.07243 V2
3   Q2 99.55322 V9
16
votes

Here is my version of the function. Advantages: no for loops, level labels match variable labels, can be applied to any df, ordered variable names beyond 26 letters, normalization of numeric variables...

Thanks go to:
@Tyler Rinker for a solution to using column names in apply functions &
@Josh O'Brien for his response to this question

It is available here as a gist.

The data from @Josh O'Brien with a non factor variable

   df <- data.frame(
  name = rep(readLines(file.path(R.home("doc"), "AUTHORS"))[9:13], each=2),
  hiscore = runif(10, 99, 100),
  passwd = replicate(10, paste(sample(c(LETTERS, letters), 9), collapse="")))

df$passwd<-as.character(df$passwd)

The function

anonym<-function(df){
  if(length(df)>26){
    LETTERS<-replicate(floor(length(df)/26),{LETTERS<-c(LETTERS, paste(LETTERS, LETTERS, sep=""))})
    }
    names(df)<-paste(LETTERS[1:length(df)])

    level.id.df<-function(df){
        level.id<-function(i){
      if(class(df[,i])=="factor" | class(df[,i])=="character"){
        column<-paste(names(df)[i],as.numeric(as.factor(df[,i])), sep=".")}else if(is.numeric(df[,i])){
          column<-df[,i]/mean(df[,i], na.rm=T)}else{column<-df[,i]}
          return(column)}
      DF <- data.frame(sapply(seq_along(df), level.id))
      names(DF) <- names(df)
      return(DF)}
    df<-level.id.df(df)
    return(df)}

anonym(df)

The results:

    A                 B    C
1  A.1  1.00492190370171  C.8
2  A.1 0.997214883153158  C.2
3  A.2  1.00532434407094  C.5
4  A.2  1.00015775550051  C.6
5  A.3 0.998947207241593  C.3
6  A.3 0.998083738806433  C.4
7  A.5  1.00322085765279  C.7
8  A.5 0.995853096468764  C.1
9  A.4 0.998662338687036 C.10
10 A.4  0.99761387471706  C.9
14
votes

If all you want to do is replace the column names with anonymous labels and likewise for levels of factors, yes. First some dummy data to use as the example

dat <- data.frame(top_secret1 = rnorm(10), top_secret2 = runif(10),
                  top_secret3 = factor(sample(3, 10, replace = TRUE),
                                       labels = paste("Person", 1:3, sep = "")))

To replace the column names do:

dat2 <- dat
colnames(dat2) <- paste("Variable", seq_len(ncol(dat2)), sep = "")

Which gives

> head(dat2)
   Variable1 Variable2 Variable3
1 -0.4858656 0.4846700   Person3
2  0.2660125 0.1727989   Person1
3  0.1595297 0.6413984   Person2
4  1.1952239 0.1892749   Person3
5  0.3914285 0.6235119   Person2
6  0.3752015 0.7057372   Person3

Next change the levels

foo <- function(x) {
    if(is.factor(x)) {
        levels(x) <- sample(LETTERS, length(levels(x)))
    }
    x
}
dat3 <- data.frame(lapply(dat2, foo))

which gives

> head(dat3)
   Variable1 Variable2 Variable3
1 -0.4858656 0.4846700         K
2  0.2660125 0.1727989         G
3  0.1595297 0.6413984         O
4  1.1952239 0.1892749         K
5  0.3914285 0.6235119         O
6  0.3752015 0.7057372         K

foo() is just a simple wrapper to a function that passed a vector checks if it is a factor, if it is, change the levels to a vector of random letters of appropriate length, then return the vector.

We can wrap this into a function to do all the changes requested

anonymise <- function(df, colString = "Variable", rowString = "Sample") {
    foo <- function(x) {
        if(is.factor(x)) {
            levels(x) <- sample(LETTERS, length(levels(x)))
        }
        x
    }
    ## replace the variable names
    colnames(df) <- paste(colString, seq_len(ncol(df)), sep = "")
    ## fudge any factor levels
    df <- data.frame(lapply(df, foo))
    ## replace rownames
    rownames(df) <- paste(rowString, seq_len(nrow(df)), sep = "")
    ## return
    df
}

In use this gives

> anonymise(dat)
           Variable1 Variable2 Variable3
Sample1  -0.48586557 0.4846700         F
Sample2   0.26601253 0.1727989         L
Sample3   0.15952973 0.6413984         N
Sample4   1.19522395 0.1892749         F
Sample5   0.39142851 0.6235119         N
Sample6   0.37520154 0.7057372         F
Sample7   1.18440762 0.7355211         F
Sample8   0.03605239 0.3924925         L
Sample9  -0.64078219 0.4579347         N
Sample10 -1.39680109 0.9047227         L