0
votes

While training my model for data greater than 20GB in BASIC Tier in Cloud ML my jobs are failing because there is no disk space available in the Cloudml machines and I am not able to find any details in gcloud ml documentations [https://cloud.google.com/ml-engine/docs/tensorflow/machine-types].

Need help in deciding the TIER for my training jobs also the utilisation is very less in Job Details Graphs.

Expand all | Collapse all {
insertId:  "1klpt2"  
jsonPayload: {
created:  1554434546.3576794   
levelname:  "ERROR"   
lineno:  51   
message:  "Failed to train : [Errno 28] No space left on device"   
pathname:  "/root/.local/lib/python3.5/site- 
packages/loggerwrapper.py"   
}
labels: {
compute.googleapis.com/resource_id:  ""   
compute.googleapis.com/resource_name:  "cmle-training- 
10361805218452604847"   
compute.googleapis.com/zone:  ""   
ml.googleapis.com/job_id/log_area:  "root"   
ml.googleapis.com/trial_id:  ""   
}
logName:  "projects/backend/logs/master-replica-0"  
receiveTimestamp:  "2019-03-31T12:32:30.07683Z"  
resource: {
labels: {
job_id:  ""    
project_id:  "backend"    
task_name:  "master-replica-0"    
}
type:  "ml_job"   
}
severity:  "ERROR"  
timestamp:  "2019-03-31T12:32:26.357679367Z"   
}
1
All the machines come with ~100GB disk. Can you try to delete cache or old files please?Guoqing Xu

1 Answers

0
votes

Solved : This error was coming not because of Storage Space instead coming because of shared memory tmfs. The sklearn fit was consuming all the shared memory while training. Solution : setting JOBLIB_TEMP_FOLDER environment variable , to /tmp solved the problem.