0
votes

My code is able to build the graph successfully and run graph in CPU mode on Azure ML, but GPU reports a ResourceException in the graph building phase.

I switch between CPU and GPU modes by simply removing device command:

with tf.device('/cpu:0'), tf.name_scope('embedding'): #cpu mode runs fine

with tf.name_scope('embedding'): #gpu mode throw exception

I tried loading less data but didn't work either.

I suspect I missed some steps when set up GPU. Any idea?

Azure error msg:

tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[78298,300] [[Node: embedding_matrix/Assign = Assign[T=DT_FLOAT, _class=["loc:@embedding_matrix"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](embedding_matrix, embedding_matrix/Initializer/Const)]]

Complete error msg:

Traceback (most recent call last): File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call return fn(*args) File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn status, run_metadata) File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[78298,300] [[Node: embedding_matrix/Assign = Assign[T=DT_FLOAT, _class=["loc:@embedding_matrix"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](embedding_matrix, embedding_matrix/Initializer/Const)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "NN.py", line 130, in sess.run(init) File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 889, in run run_metadata_ptr) File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1120, in _run feed_dict_tensor, options, run_metadata) File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run options, run_metadata) File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[78298,300] [[Node: embedding_matrix/Assign = Assign[T=DT_FLOAT, _class=["loc:@embedding_matrix"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](embedding_matrix, embedding_matrix/Initializer/Const)]]

Caused by op 'embedding_matrix/Assign', defined at: File "NN.py", line 120, in , trainable=False) File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 1203, in get_variable constraint=constraint) File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 1092, in get_variable constraint=constraint) File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 425, in get_variable constraint=constraint) File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 394, in _true_getter use_resource=use_resource, constraint=constraint) File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 805, in _get_single_variable constraint=constraint) File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/variables.py", line 213, in init constraint=constraint) File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/variables.py", line 346, in _init_from_args validate_shape=validate_shape).op File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/state_ops.py", line 276, in assign validate_shape=validate_shape) File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/gen_state_ops.py", line 57, in assign use_locking=use_locking, name=name) File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op op_def=op_def) File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1470, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[78298,300] [[Node: embedding_matrix/Assign = Assign[T=DT_FLOAT, _class=["loc:@embedding_matrix"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](embedding_matrix, embedding_matrix/Initializer/Const)]]

1

1 Answers

0
votes

Host memory is quite a bit larger that device memory for an N-series machine. Are you sure you simply aren't exceeding the device capacity?