Unable to re-run the notebook

giemmecci · September 2, 2021, 10:14pm

Hi, I was trying to re-run my notebook but I’m getting the following error message when I try to run a segmentation model using docker

THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=47 error=999 : unknown error
Traceback (most recent call last):
  File "/usr/local/bin/hd-bet", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/HD-BET/HD_BET/hd-bet", line 119, in <module>
    run_hd_bet(input_files, output_files, mode, config_file, device, pp, tta, save_mask, overwrite_existing)
  File "/HD-BET/HD_BET/run.py", line 63, in run_hd_bet
    net.cuda(device)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 458, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 354, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 376, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 458, in <lambda>
    return self._apply(lambda t: t.cuda(device))
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 190, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (999) : unknown error at /pytorch/aten/src/THC/THCGeneral.cpp:47
Using contrast T1 as reference
Traceback (most recent call last):
  File "scripts/run.py", line 505, in <module>
    not args.no_permissions
  File "scripts/run.py", line 280, in run
    output1 = subp.check_output(["hd-bet", "-i", file_, "-device", "0"])
  File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['hd-bet', '-i', '/output/T1_r2s.nii.gz', '-device', '0']' returned non-zero exit status 1.

I’ve tried looking for solutions online, but nothing worked. I didn’t make any change to the environment or my virtual machine, and the same code works just fine on another virtual machine on the same Google Cloud project.

Any clue of what could be the issue?

Thanks!

aptekarev · September 3, 2021, 2:45pm

With the log it looks like the problem is with the GPU. Either outdated drivers or a general configuration error. Is your VM a managed notebook instance?

fedorov · September 3, 2021, 5:00pm

@giemmecci can you provide the details about what is going on:

are you using AI Notebooks VM and not Colab?
are you saying the exact same VM/notebook that worked in the past does not work, and you didn’t apply any system updates to the VM?
can you confirm which notebook you are having troubles with? I tried to run this notebook https://github.com/giemmecci/IDC/blob/main/GBM_IDH1_mutation_radiomic_classifier-Copy-GCP.ipynb, but I am not sure it is ready to use. I see that you check out https://github.com/dicomsort/dicomsort, which is a completely different tool from https://github.com/pieper/dicomsort, and so where you try to run dicomsort/dicomsort.py is not expected to work. I am not sure if I am looking at the right notebook.

giemmecci · September 3, 2021, 10:19pm

Yes, it’s definitely something with the GPU; I’ve tried looking for the error online, and people seem to fix it either by re-installing PyTorch (which is not the solution for me since I haven’t installed it on my VM; the error arises when trying to apply a segmentation model using docker) or reloading the NVIDIA kernel.
I’m not sure if this will answer your question, but I’m launching the notebook from the AI Platform section (project: idc-external-005, instance gm-tf2).

TCGA-GBM tutorial notebook

are you using AI Notebooks VM and not Colab?
Yes, AI Notebooks VM, no Colab.

are you saying the exact same VM/notebook that worked in the past does not work, and you didn’t apply any system updates to the VM?
Yes; I didn’t update the VM, and I was able to run the docker with the segmentation model in the past. I’ve tried running the notebook on another VM on the same project (kevin-ml) and it worked.

can you confirm which notebook you are having troubles with? I tried to run this notebook https://github.com/giemmecci/IDC/blob/main/GBM_IDH1_mutation_radiomic_classifier-Copy-GCP.ipynb, but I am not sure it is ready to use. I see that you check out GitHub - dicomsort/dicomsort: DICOM sorting utility, which is a completely different tool from GitHub - pieper/dicomsort: A project to provide custom sorting and renaming of dicom files, and so where you try to run dicomsort/dicomsort.py is not expected to work. I am not sure if I am looking at the right notebook.
Oh, good catch! I didn’t catch the dicomsort error because I had already installed the correct version before starting working on this notebook.
I’ve created an updated version of the notebook that should run fine from scratch; you can find it here.

Thank you for your help and your time!

aptekarev · September 5, 2021, 9:53am

Since those are managed instances - you can just create a new one and check if it works in the new instance. If it does work my guess would be that the error you see on an older instance is related to the recent merger of the AI Notebooks into the newer Vertex AI product in Google Cloud.

giemmecci · September 5, 2021, 9:37pm

Thanks for the suggestion, but unfortunately it didn’t work; I’ve created another instance with the same characteristics as my older one, but I’m still getting the same error message:

09:30:23 PM: Reading series 1.3.6.1.4.1.14519.5.2.1.1706.4001.103174687731052142735983046836
09:30:23 PM: Reading series 1.3.6.1.4.1.14519.5.2.1.1706.4001.287753826073472752590065451465
09:30:23 PM: Reading series 1.3.6.1.4.1.14519.5.2.1.1706.4001.261091685370401963952512175444
09:30:23 PM: Reading series 1.3.6.1.4.1.14519.5.2.1.1706.4001.116289242878713280880004720679
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=47 error=999 : unknown error
Traceback (most recent call last):
  File "/usr/local/bin/hd-bet", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/HD-BET/HD_BET/hd-bet", line 119, in <module>
    run_hd_bet(input_files, output_files, mode, config_file, device, pp, tta, save_mask, overwrite_existing)
  File "/HD-BET/HD_BET/run.py", line 63, in run_hd_bet
    net.cuda(device)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 458, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 354, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 376, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 458, in <lambda>
    return self._apply(lambda t: t.cuda(device))
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 190, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (999) : unknown error at /pytorch/aten/src/THC/THCGeneral.cpp:47
Using contrast T1 as reference
Traceback (most recent call last):
  File "scripts/run.py", line 505, in <module>
    not args.no_permissions
  File "scripts/run.py", line 280, in run
    output1 = subp.check_output(["hd-bet", "-i", file_, "-device", "0"])
  File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['hd-bet', '-i', '/output/T1_r2s.nii.gz', '-device', '0']' returned non-zero exit status 1.

giemmecci · September 5, 2021, 9:39pm

I’ve updated the notebook to make the test shorter (reduced the number of MRIs that are downloaded)

denbonte · October 8, 2021, 9:32am

Hey @giemmecci,

I just documented in another thread what might be a solution for your problem. Your case looks different (as the error CUDA raises in your case is a RuntimeError and not an initialization error) - but it’s worth a shot (especially because re-installing NVIDIA drivers usually solves most of the CUDA issues anyways).

If you still have that instance paused, could you try the fix and let us know if that works? If that doesn’t solve it, feel free to come back with some logs. I would be happy to try and troubleshoot this (and curious to see if we can manage to replicate the issue).

Thanks,
Dennis.

giemmecci · October 11, 2021, 8:43pm

Hi! Thanks so much for the hint! I’ll give it a try!

Topic		Replies	Views
CUDA error after updating AI Notebook/VM Support documentation , bug , ai-notebooks	0	537	October 8, 2021
How to submit the tutorial notebook? Use cases	11	611	April 19, 2021
Extracting custom patch sizes for lung CT nodule segmentation demo Developers tutorial , colab , ai-notebooks	15	483	August 15, 2022
IDC/TCIA session at the RSNA 2021 Deep Learning Lab Announcements colab	0	491	December 9, 2021
TCGA-GBM tutorial notebook Support question	27	2451	September 10, 2021

Unable to re-run the notebook

Related topics