5
votes

I've a CSV file which has 600,000 rows and 1339 columns making 1.6 GB. 1337 columns are binaries taking either 1 or 0 values and other 2 columns are numeric and character variables.

I pulled the data use the package readr with following code

VLU_All_Before_Wide <- read_csv("C:/Users/petas/Desktop/VLU_All_Before_Wide_Sample.csv")

When I checked the object size using following code, it's about 3 gb.

> print(object.size(VLU_All_Before_Wide),units="Gb")
3.2 Gb

In the next step, using the below code, I want to create training and test set for LASSO regression.

set.seed(1234)
train_rows <- sample(1:nrow(VLU_All_Before_Wide), .7*nrow(VLU_All_Before_Wide))

train_set <- VLU_All_Before_Wide[train_rows,]
test_set <- VLU_All_Before_Wide[-train_rows,]

yall_tra <- data.matrix(subset(train_set, select=VLU_Incidence))
xall_tra <- data.matrix(subset(train_set, select=-c(VLU_Incidence,Replicate)))
yall_tes <- data.matrix(subset(test_set, select=VLU_Incidence))
xall_tes <- data.matrix(subset(test_set, select=-c(VLU_Incidence,Replicate)))

When I started my R session the RAM was at ~3 gb and by the time I exicuted all the above code it's now at 14 gb, leaving me an error saying can't allocate vector of size 4 gb. There was no other application running other than 3 chrome windows. I removed the original dataset, training and test dataset but it only reduced .7 to 1 gb RAM.

rm(VLU_All_Before_Wide)
rm(test_set)
rm(train_set)

Appreciate if someone can guide me a way to reduce the size of the data.

Thanks

1
Do you need all of the columns? You can use fread in the data.table package to select the columns you want to importMatthewR
Actually, I've ~2,000 variables, I excluded columns with a less than 15 count as statisticians use minimum 15 count as a rule of thumb for any statistical analysis. I'm not sure if all other variables are predictive or not yet.petax004

1 Answers

3
votes

R struggles when it comes to huge datasets because it tries to load and keep all the data into the RAM. You can use other packages available in R which are made to handle big datasets, like 'bigmemory and ff. Check my answer here which addresses a similar issue.

You can also choose to do some data processing & manipulation outside R and remove unnecessary columns and rows. But still, to handle big datasets, it's better to use the capable packages.