Hey IDC users!
Yesterday I updated one of the AI notebooks I’m using - and to my great surprise, when I tried to run an inference pipeline which leverages and usually runs on the GPU, I got it running on the CPU together with this error message:
/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py:52:
UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.
(Triggered internally at /opt/conda/conda-bld/pytorch_1614378098133/work/c10/cuda/CUDAFunctions.cpp:109.)
The same error could be replicated by running the following in a Python terminal:
>>> import torch
>>> torch.cuda.is_available()
This looked quite strange from the start, since the GPU was correctly detected by nvidia-smi
.
For the sake of completeness, here are the specs of the VM:
- Environment:
PyTorch:1.7
- Machine Type:
8 vCPUs, 52 GB RAM
- GPUs:
NVIDIA Tesla T4 x 1
Restarting the instance didn’t fix the problem, so I started digging a bit online, and found this issue opened on Google’s issue tracker: Unable to detect GPU via Tensorflow/Pytorch after restart DLVM.
Some other users report a similar problem but with different CUDA warnings given:
[tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
In both cases, for both AI Notebooks and VMs, the problem can be solved running the following commands:
gsutil cp gs://dl-platform-public-nvidia/b191551132/restart_patch.sh /tmp/restart_patch.sh
chmod +x /tmp/restart_patch.sh
sudo /tmp/restart_patch.sh
sudo service jupyter restart
It’s not entirely clear to me if the issue was triggered by the NVIDIA drivers updates I did after booting, or by automatic updates on GCP instances - nor why I experienced it some two months after many user did (especially given that I update the machine via aptitude every month). Looking at the patching script though, it seems obvious that the cause it’s indeed NVIDIA drivers:
DEEPLEARNING_PATH="/opt/deeplearning"
GOOGLE_GCS_PATH="gs://dl-platform-public-nvidia/b191551132"
${DEEPLEARNING_PATH}/uninstall-driver.sh
gsutil cp "${GOOGLE_GCS_PATH}/driver-version.sh" "${DEEPLEARNING_PATH}/driver-version.sh"
chmod +x ${DEEPLEARNING_PATH}/driver-version.sh
${DEEPLEARNING_PATH}/install-driver.sh
This issue should not affect new AI Notebooks or VMs.