We are running Tensorflow applications on GPU using multiple Jupyter notebooks. Every once in a while one of the runs crashes the notebook, with the simple notification that "The kernel has crashed...".
When we placed the code into a python .py file, the stderr output was
F tensorflow/core/kernels/conv_ops_3d.cc:369] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms)
Aborted
In another run the stderr reported:
F tensorflow/core/common_runtime/gpu/gpu_util.cc:296] GPU->CPU Memcpy failed
The problem is that the tensorflow applications are grabbing a lot of memory. In Linux you can run top to see what is going on. On our machine we saw that each tensorflow process was grabbing 0.55t!
When you run the process inside a Jupyter notebook and do not shutdown the notebook, the notebook does not release the memory. At some point you will run a process that cannot access memory and it will die. If you are running inside a notebook it will only tell you that the kernel has died.
Can anyone help with this?