Updated to incorporate @Arun's comments, and to demo with a dataset more representative of the real case.
This question is really interesting primarily because the full dataset is so large (117,000 rows X 10,000 columns). So the main issue is not how to flag the less common entries, but how to do that efficiently. This shows three options: a data.frame option, a naive data.table option (mine), and a sophisticated data.table option (@Arun's). It's an object lesson in the superiority of data.table, especially when used correctly.
# create sample: 117,000 rows, 100 columns (1% of real case)
set.seed(1) # for reproducibility
ltrs <- sapply(letters,function(x)paste(rep(x,3),collapse=""))
df <- data.frame(replicate(100,sample(ltrs, 117e3, replace = TRUE,
p=c(0.5,0.3,0.1,0.09,rep(0.01/22,22)))),
stringsAsFactors = FALSE)
So this data-set has 117,000 rows and 100 columns (100% of the rows, 1% of the columns). Each entry is a three letter string ("AAA", "BBB", etc). AAA - DDD account for 99% of the entries and the other 22 letters collectively account for the remaining 1%. So there are bound to be instances of these with frequency < 100.
# data.frame option
set.col <- function(x) {
tbl <- table(x)
x[x%in%names(tbl)[tbl<100]]<-NA
x
}
system.time(
result.1 <- as.data.frame(sapply(df,set.col))
)
# user system elapsed
# 44.52 0.27 44.95
So the data.frame option runs in ~44 sec (on my system). The real case would run in ~4400 sec, or about 73min.
# naive data.table
library(data.table)
result.2 <- as.data.table(df)
system.time(
for (j in 1:ncol(df)) {
tbl <- table(result.2[[j]])
set(result.2, i=which(result.2[[j]]%in%names(tbl)[tbl<100]),j=j, NA)
}
)
# user system elapsed
# 1.51 0.05 1.56
The naive data.table approach still uses table(...)
but takes advantage of set(...)
. It runs about 30X faster. The full data-set would run in about 3 min.
# sophisticated data.table
# install.packages("data.table", type="source") # requires 1.9.6+
library(data.table)
result.3 <- as.data.table(df)
system.time(
for (j in 1:ncol(df)) {
tbl <- result.3[,.N,by=c(names(result.3)[j])][N<100]
result.3[tbl, c(names(result.3)[j]):=NA, on=c(names(result.3)[j])]
}
)
# user system elapsed
# 0.65 0.00 0.67
identical(result.2,result.3)
# [1] TRUE
The sophisticated data.table solution runs 2.5X faster still, and about 70X faster than the data frame option. The full data-set should run in about 65 sec.
The following just demonstrates that option 3 works with the example in the question:
# import sample data
df <- read.table(header=F, row.names=1, colClasses=c("character"),
text="1 200 444
2 310 000
3 310 000
4 444 444
5 200 444
6 200 112
7 310 444
8 310 876
9 310 876
10 444 876")
result.3 <- as.data.table(df)
for (j in 1:ncol(df)) {
tbl <- result.3[,.N,by=c(names(result.3)[j])][N<3]
result.3[tbl, c(names(result.3)[j]):=NA, on=c(names(result.3)[j])]
}
result.3
# V2 V3
# 1: 200 444
# 2: 310 NA
# 3: 310 NA
# 4: NA 444
# 5: 200 444
# 6: 200 NA
# 7: 310 444
# 8: 310 876
# 9: 310 876
# 10: NA 876
To write the result to a csv file, use something like:
write.csv(result.3, file="myfile.csv")
0
would not normally display as000
. – jlhowardNA
though. – jlhoward