How to reduce the size of the data in R?

Question

I've a CSV file which has 600,000 rows and 1339 columns making 1.6 GB. 1337 columns are binaries taking either 1 or 0 values and other 2 columns are numeric and character variables.

I pulled the data use the package readr with following code

VLU_All_Before_Wide <- read_csv("C:/Users/petas/Desktop/VLU_All_Before_Wide_Sample.csv")

When I checked the object size using following code, it's about 3 gb.

> print(object.size(VLU_All_Before_Wide),units="Gb")
3.2 Gb

In the next step, using the below code, I want to create training and test set for LASSO regression.

set.seed(1234)
train_rows <- sample(1:nrow(VLU_All_Before_Wide), .7*nrow(VLU_All_Before_Wide))

train_set <- VLU_All_Before_Wide[train_rows,]
test_set <- VLU_All_Before_Wide[-train_rows,]

yall_tra <- data.matrix(subset(train_set, select=VLU_Incidence))
xall_tra <- data.matrix(subset(train_set, select=-c(VLU_Incidence,Replicate)))
yall_tes <- data.matrix(subset(test_set, select=VLU_Incidence))
xall_tes <- data.matrix(subset(test_set, select=-c(VLU_Incidence,Replicate)))

When I started my R session the RAM was at ~3 gb and by the time I exicuted all the above code it's now at 14 gb, leaving me an error saying can't allocate vector of size 4 gb. There was no other application running other than 3 chrome windows. I removed the original dataset, training and test dataset but it only reduced .7 to 1 gb RAM.

rm(VLU_All_Before_Wide)
rm(test_set)
rm(train_set)

Appreciate if someone can guide me a way to reduce the size of the data.

Thanks

Do you need all of the columns? You can use fread in the data.table package to select the columns you want to import — MatthewR
Actually, I've ~2,000 variables, I excluded columns with a less than 15 count as statisticians use minimum 15 count as a rule of thumb for any statistical analysis. I'm not sure if all other variables are predictive or not yet. — petax004

snair.stack snair.stack · Accepted Answer · 2019-03-21T05:57:13

R struggles when it comes to huge datasets because it tries to load and keep all the data into the RAM. You can use other packages available in R which are made to handle big datasets, like 'bigmemory and ff. Check my answer here which addresses a similar issue.

You can also choose to do some data processing & manipulation outside R and remove unnecessary columns and rows. But still, to handle big datasets, it's better to use the capable packages.

How to reduce the size of the data in R?

1 Answers