efficiently replacing data frame with cumulative frequency

Question

I'm trying to write a program that takes a large data frame and replaces each column of values by the cumulative frequency of those values (sorted ascending). For instance, if the column of values are: 5, 8, 3, 5, 4, 3, 8, 5, 5, 1. Then the relative and cumulative frequencies are:

1: rel_freq=0.1, cum_freq = 0.1
3: rel_freq=0.2, cum_freq = 0.3
4: rel_freq=0.1, cum_freq = 0.4
5: rel_freq=0.4, cum_freq = 0.8
8: rel_freq=0.2, cum_freq = 1.0

Then the original column becomes: 0.8, 1.0, 0.3, 0.8, 0.4, 0.3, 1.0, 0.8, 0.8, 0.1

The following code performs this operation correctly, but it scales poorly probably due to the nested loop. Any idea how to perform this task more efficiently?

mydata = read.table(.....)

totalcols = ncol(mydata)
totalrows = nrow(mydata)

for (i in 1:totalcols) {
    freqtable = data.frame(table(mydata[,i])/totalrows)  # create freq table
    freqtable$CumSum = cumsum(freqtable$Freq)   # calc cumulative freq

    hashtable = new.env(hash=TRUE)
    nrows = nrow(freqtable)

    # store cum freq in hash
    for (x in 1:nrows) {
        dummy = toString(freqtable$Var1[x])
        hashtable[[dummy]] = freqtable$CumSum[x]
    }

    # replace original data with cum freq
    for (j in 1:totalrows) {
        dummy = toString(mydata[j,i])
        mydata[j,i] = hashtable[[dummy]]
    }
}

rcs rcs · Accepted Answer · 2012-10-23T19:02:07

This handles a single column without the for-loop:

R> x <- c(5, 8, 3, 5, 4, 3, 8, 5, 5, 1)
R> y <- cumsum(table(x)/length(x))
R> y[as.character(x)]
  5   8   3   5   4   3   8   5   5   1 
0.8 1.0 0.3 0.8 0.4 0.3 1.0 0.8 0.8 0.1

efficiently replacing data frame with cumulative frequency

2 Answers