1
votes

I have a dataframe of 117,000 rows with 10,000 columns containing values, some being repeated frequently and some not. I aim to replace those values being repeated less than 100 times within each column to zero. I put a small example here: my input is like this:

1 200 444
2 310 000
3 310 000
4 444 444
5 200 444
6 200 112
7 310 444
8 310 876
9 310 876
10 444 876

I need in my output any value which is repeated less than 3 times within a column be set to zero (for example in column 1, value 444 is repeated less than 3 times, so it should be set as zero and the same should be done for values 000 and 112 in column 2) :

1 200 444
2 310  0
3 310  0
4  0  444
5 200 444
6 200  0
7 310 444
8 310 876
9 310 876
10 0  876

Could any body help me by writing a script to do this in R for a huge data.frame? (117000 rows and 10000 column)?

2
Are these "numbers" numbers or character? The number 0 would not normally display as 000.jlhoward
@jlhoward: that is right. you can consider them as characters.zara
If that's the case, you cannot mix numeric and character in a single column. You could set to NA though.jlhoward

2 Answers

2
votes

Updated to incorporate @Arun's comments, and to demo with a dataset more representative of the real case.

This question is really interesting primarily because the full dataset is so large (117,000 rows X 10,000 columns). So the main issue is not how to flag the less common entries, but how to do that efficiently. This shows three options: a data.frame option, a naive data.table option (mine), and a sophisticated data.table option (@Arun's). It's an object lesson in the superiority of data.table, especially when used correctly.

# create sample: 117,000 rows, 100 columns (1% of real case)
set.seed(1)    # for reproducibility
ltrs <- sapply(letters,function(x)paste(rep(x,3),collapse=""))
df   <- data.frame(replicate(100,sample(ltrs, 117e3, replace = TRUE, 
                                        p=c(0.5,0.3,0.1,0.09,rep(0.01/22,22)))), 
                   stringsAsFactors = FALSE)

So this data-set has 117,000 rows and 100 columns (100% of the rows, 1% of the columns). Each entry is a three letter string ("AAA", "BBB", etc). AAA - DDD account for 99% of the entries and the other 22 letters collectively account for the remaining 1%. So there are bound to be instances of these with frequency < 100.

# data.frame option
set.col <- function(x) {
  tbl <- table(x)
  x[x%in%names(tbl)[tbl<100]]<-NA
  x
}
system.time(
  result.1 <- as.data.frame(sapply(df,set.col))
)
#    user  system elapsed 
#   44.52    0.27   44.95 

So the data.frame option runs in ~44 sec (on my system). The real case would run in ~4400 sec, or about 73min.

# naive data.table
library(data.table)
result.2 <- as.data.table(df)
system.time(
  for (j in 1:ncol(df)) {
    tbl <- table(result.2[[j]])
    set(result.2, i=which(result.2[[j]]%in%names(tbl)[tbl<100]),j=j, NA)
  }
)
#    user  system elapsed 
#    1.51    0.05    1.56 

The naive data.table approach still uses table(...) but takes advantage of set(...). It runs about 30X faster. The full data-set would run in about 3 min.

# sophisticated data.table
# install.packages("data.table", type="source")   # requires 1.9.6+
library(data.table)
result.3 <- as.data.table(df)
system.time(
  for (j in 1:ncol(df)) {
    tbl <- result.3[,.N,by=c(names(result.3)[j])][N<100]
    result.3[tbl, c(names(result.3)[j]):=NA, on=c(names(result.3)[j])]
  }
)
#    user  system elapsed 
#    0.65    0.00    0.67 
identical(result.2,result.3)
# [1] TRUE

The sophisticated data.table solution runs 2.5X faster still, and about 70X faster than the data frame option. The full data-set should run in about 65 sec.

The following just demonstrates that option 3 works with the example in the question:

# import sample data
df <- read.table(header=F, row.names=1, colClasses=c("character"),
                 text="1 200 444
                 2 310 000
                 3 310 000
                 4 444 444
                 5 200 444
                 6 200 112
                 7 310 444
                 8 310 876
                 9 310 876
                 10 444 876")
result.3 <- as.data.table(df)
for (j in 1:ncol(df)) {
  tbl <- result.3[,.N,by=c(names(result.3)[j])][N<3]
  result.3[tbl, c(names(result.3)[j]):=NA, on=c(names(result.3)[j])]
}

result.3
#      V2  V3
#  1: 200 444
#  2: 310  NA
#  3: 310  NA
#  4:  NA 444
#  5: 200 444
#  6: 200  NA
#  7: 310 444
#  8: 310 876
#  9: 310 876
# 10:  NA 876

To write the result to a csv file, use something like:

write.csv(result.3, file="myfile.csv")
0
votes

OK, here is a solution for a similar problem to the one you laid out in your simplified version. The comments should explain it as it goes along.

Here we replace all elements which appear strictly less than four times, so the whole second column of the test_df should go to zero.

# Creates fake test dataframe 
col1 <- c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3)
col2 <- c(3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
test_df <- data.frame(col1, col2)

# Finds the number of occurences of every element in the dataframe
occurences <- table(unlist(test_df))

# Find the unique elements across the whole dataframe 
elements <- unique(unlist(test_df))

# Creates an empty vector for all elements less than four 
elements_less_than_four <- c()

# Loops through all elements in the dataframe and if they appear less than
# four times puts them in a list 
for(el in elements){

  if( occurences[[el]] < 4){ 
    elements_less_than_four <- c(elements_less_than_four, el) 
  }

}

# Unlist the df for quick comparison, turn all necessary values to zero.
unlisted_df <- as.vector(unlist(test_df))
correct_values <- replace(unlisted_df, unlisted_df %in% elements_less_than_four, 0)

# Reformats the dataframe
finished_df <- data.frame(matrix(correct_values, nrow = nrow(test_df), 
                          ncol = ncol(test_df), byrow = FALSE))