Solved : No space left on device in google Cloudml BASIC TIER. What is the disk size of each tier in cloudml?

Question

While training my model for data greater than 20GB in BASIC Tier in Cloud ML my jobs are failing because there is no disk space available in the Cloudml machines and I am not able to find any details in gcloud ml documentations [https://cloud.google.com/ml-engine/docs/tensorflow/machine-types].

Need help in deciding the TIER for my training jobs also the utilisation is very less in Job Details Graphs.

Expand all | Collapse all {
insertId:  "1klpt2"  
jsonPayload: {
created:  1554434546.3576794   
levelname:  "ERROR"   
lineno:  51   
message:  "Failed to train : [Errno 28] No space left on device"   
pathname:  "/root/.local/lib/python3.5/site- 
packages/loggerwrapper.py"   
}
labels: {
compute.googleapis.com/resource_id:  ""   
compute.googleapis.com/resource_name:  "cmle-training- 
10361805218452604847"   
compute.googleapis.com/zone:  ""   
ml.googleapis.com/job_id/log_area:  "root"   
ml.googleapis.com/trial_id:  ""   
}
logName:  "projects/backend/logs/master-replica-0"  
receiveTimestamp:  "2019-03-31T12:32:30.07683Z"  
resource: {
labels: {
job_id:  ""    
project_id:  "backend"    
task_name:  "master-replica-0"    
}
type:  "ml_job"   
}
severity:  "ERROR"  
timestamp:  "2019-03-31T12:32:26.357679367Z"   
}

All the machines come with ~100GB disk. Can you try to delete cache or old files please? — Guoqing Xu

Sumit Srivastava Sumit Srivastava · Accepted Answer · 2019-06-07T22:35:46

Solved : This error was coming not because of Storage Space instead coming because of shared memory tmfs. The sklearn fit was consuming all the shared memory while training. Solution : setting JOBLIB_TEMP_FOLDER environment variable , to /tmp solved the problem.

Solved : No space left on device in google Cloudml BASIC TIER. What is the disk size of each tier in cloudml?

1 Answers