Unable to re-run the notebook

Hi, I was trying to re-run my notebook but I’m getting the following error message when I try to run a segmentation model using docker

THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=47 error=999 : unknown error
Traceback (most recent call last):
  File "/usr/local/bin/hd-bet", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/HD-BET/HD_BET/hd-bet", line 119, in <module>
    run_hd_bet(input_files, output_files, mode, config_file, device, pp, tta, save_mask, overwrite_existing)
  File "/HD-BET/HD_BET/run.py", line 63, in run_hd_bet
    net.cuda(device)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 458, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 354, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 376, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 458, in <lambda>
    return self._apply(lambda t: t.cuda(device))
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 190, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (999) : unknown error at /pytorch/aten/src/THC/THCGeneral.cpp:47
Using contrast T1 as reference
Traceback (most recent call last):
  File "scripts/run.py", line 505, in <module>
    not args.no_permissions
  File "scripts/run.py", line 280, in run
    output1 = subp.check_output(["hd-bet", "-i", file_, "-device", "0"])
  File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['hd-bet', '-i', '/output/T1_r2s.nii.gz', '-device', '0']' returned non-zero exit status 1.

I’ve tried looking for solutions online, but nothing worked. I didn’t make any change to the environment or my virtual machine, and the same code works just fine on another virtual machine on the same Google Cloud project.

Any clue of what could be the issue?

Thanks!

With the log it looks like the problem is with the GPU. Either outdated drivers or a general configuration error. Is your VM a managed notebook instance?

1 Like

@giemmecci can you provide the details about what is going on:

1 Like

Yes, it’s definitely something with the GPU; I’ve tried looking for the error online, and people seem to fix it either by re-installing PyTorch (which is not the solution for me since I haven’t installed it on my VM; the error arises when trying to apply a segmentation model using docker) or reloading the NVIDIA kernel.
I’m not sure if this will answer your question, but I’m launching the notebook from the AI Platform section (project: idc-external-005, instance gm-tf2).

Thank you for your help and your time!

Since those are managed instances - you can just create a new one and check if it works in the new instance. If it does work my guess would be that the error you see on an older instance is related to the recent merger of the AI Notebooks into the newer Vertex AI product in Google Cloud.

1 Like

Thanks for the suggestion, but unfortunately it didn’t work; I’ve created another instance with the same characteristics as my older one, but I’m still getting the same error message:

09:30:23 PM: Reading series 1.3.6.1.4.1.14519.5.2.1.1706.4001.103174687731052142735983046836
09:30:23 PM: Reading series 1.3.6.1.4.1.14519.5.2.1.1706.4001.287753826073472752590065451465
09:30:23 PM: Reading series 1.3.6.1.4.1.14519.5.2.1.1706.4001.261091685370401963952512175444
09:30:23 PM: Reading series 1.3.6.1.4.1.14519.5.2.1.1706.4001.116289242878713280880004720679
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=47 error=999 : unknown error
Traceback (most recent call last):
  File "/usr/local/bin/hd-bet", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/HD-BET/HD_BET/hd-bet", line 119, in <module>
    run_hd_bet(input_files, output_files, mode, config_file, device, pp, tta, save_mask, overwrite_existing)
  File "/HD-BET/HD_BET/run.py", line 63, in run_hd_bet
    net.cuda(device)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 458, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 354, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 376, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 458, in <lambda>
    return self._apply(lambda t: t.cuda(device))
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 190, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (999) : unknown error at /pytorch/aten/src/THC/THCGeneral.cpp:47
Using contrast T1 as reference
Traceback (most recent call last):
  File "scripts/run.py", line 505, in <module>
    not args.no_permissions
  File "scripts/run.py", line 280, in run
    output1 = subp.check_output(["hd-bet", "-i", file_, "-device", "0"])
  File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['hd-bet', '-i', '/output/T1_r2s.nii.gz', '-device', '0']' returned non-zero exit status 1.

I’ve updated the notebook to make the test shorter (reduced the number of MRIs that are downloaded)

1 Like

Hey @giemmecci,

I just documented in another thread what might be a solution for your problem. Your case looks different (as the error CUDA raises in your case is a RuntimeError and not an initialization error) - but it’s worth a shot (especially because re-installing NVIDIA drivers usually solves most of the CUDA issues anyways).

If you still have that instance paused, could you try the fix and let us know if that works? If that doesn’t solve it, feel free to come back with some logs. I would be happy to try and troubleshoot this (and curious to see if we can manage to replicate the issue).

Thanks,
Dennis.

3 Likes

Hi! Thanks so much for the hint! I’ll give it a try!

1 Like