CUDA error after updating AI Notebook/VM

Hey IDC users!

Yesterday I updated one of the AI notebooks I’m using - and to my great surprise, when I tried to run an inference pipeline which leverages and usually runs on the GPU, I got it running on the CPU together with this error message:

/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py:52:
UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.
(Triggered internally at  /opt/conda/conda-bld/pytorch_1614378098133/work/c10/cuda/CUDAFunctions.cpp:109.)

The same error could be replicated by running the following in a Python terminal:

>>> import torch
>>> torch.cuda.is_available()

This looked quite strange from the start, since the GPU was correctly detected by nvidia-smi.

For the sake of completeness, here are the specs of the VM:

  • Environment: PyTorch:1.7
  • Machine Type: 8 vCPUs, 52 GB RAM
  • GPUs: NVIDIA Tesla T4 x 1

Restarting the instance didn’t fix the problem, so I started digging a bit online, and found this issue opened on Google’s issue tracker: Unable to detect GPU via Tensorflow/Pytorch after restart DLVM.

Some other users report a similar problem but with different CUDA warnings given:

[tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

In both cases, for both AI Notebooks and VMs, the problem can be solved running the following commands:

gsutil cp gs://dl-platform-public-nvidia/b191551132/restart_patch.sh /tmp/restart_patch.sh
chmod +x /tmp/restart_patch.sh
sudo /tmp/restart_patch.sh
sudo service jupyter restart

It’s not entirely clear to me if the issue was triggered by the NVIDIA drivers updates I did after booting, or by automatic updates on GCP instances - nor why I experienced it some two months after many user did (especially given that I update the machine via aptitude every month). Looking at the patching script though, it seems obvious that the cause it’s indeed NVIDIA drivers:

DEEPLEARNING_PATH="/opt/deeplearning"
GOOGLE_GCS_PATH="gs://dl-platform-public-nvidia/b191551132"

${DEEPLEARNING_PATH}/uninstall-driver.sh
gsutil cp "${GOOGLE_GCS_PATH}/driver-version.sh" "${DEEPLEARNING_PATH}/driver-version.sh"
chmod +x ${DEEPLEARNING_PATH}/driver-version.sh
${DEEPLEARNING_PATH}/install-driver.sh

This issue should not affect new AI Notebooks or VMs.

4 Likes