NeMo

Unable to import pytorch_lightning and torchmetrics in recommended NeMo container

#6
by Lauler - opened

I have followed the instructions in your README to perform inference. However, the base image you are recommending is not able to import pytorch_lightning, nor torchmetricswithout errors.

When I try to launch an inference server with your recommended container nvcr.io/nvidia/nemo:24.01.framework

0: waiting for server (0.0.0.0:1424) to be up
0: Traceback (most recent call last):
0:   File "/opt/NeMo/examples/nlp/language_modeling/megatron_gpt_eval.py", line 23, in <module>
0:     from pytorch_lightning.trainer.trainer import Trainer
0:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/__init__.py", line 26, in <module>
0:     from pytorch_lightning.callbacks import Callback  # noqa: E402
0:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/__init__.py", line 14, in <module>
0:     from pytorch_lightning.callbacks.batch_size_finder import BatchSizeFinder
0:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/batch_size_finder.py", line 24, in <module>
0:     from pytorch_lightning.callbacks.callback import Callback
0:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/callback.py", line 22, in <module>
0:     from pytorch_lightning.utilities.types import STEP_OUTPUT
0:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/types.py", line 25, in <module>
0:     from torchmetrics import Metric
0:   File "/usr/local/lib/python3.10/dist-packages/torchmetrics/__init__.py", line 14, in <module>
0:     from torchmetrics import functional  # noqa: E402
0:   File "/usr/local/lib/python3.10/dist-packages/torchmetrics/functional/__init__.py", line 14, in <module>
0:     from torchmetrics.functional.audio.pit import permutation_invariant_training, pit_permutate
0:   File "/usr/local/lib/python3.10/dist-packages/torchmetrics/functional/audio/__init__.py", line 14, in <module>
0:     from torchmetrics.functional.audio.pit import permutation_invariant_training, pit_permutate  # noqa: F401
0:   File "/usr/local/lib/python3.10/dist-packages/torchmetrics/functional/audio/pit.py", line 21, in <module>
0:     from torchmetrics.utilities.imports import _SCIPY_AVAILABLE
0:   File "/usr/local/lib/python3.10/dist-packages/torchmetrics/utilities/__init__.py", line 1, in <module>
0:     from torchmetrics.utilities.checks import check_forward_full_state_property  # noqa: F401
0:   File "/usr/local/lib/python3.10/dist-packages/torchmetrics/utilities/checks.py", line 22, in <module>
0:     from torchmetrics.utilities.data import select_topk, to_onehot
0:   File "/usr/local/lib/python3.10/dist-packages/torchmetrics/utilities/data.py", line 19, in <module>
0:     from torchmetrics.utilities.imports import _TORCH_GREATER_EQUAL_1_6, _TORCH_GREATER_EQUAL_1_7, _TORCH_GREATER_EQUAL_1_8
0:   File "/usr/local/lib/python3.10/dist-packages/torchmetrics/utilities/imports.py", line 117, in <module>
0:     _TORCHVISION_GREATER_EQUAL_0_8: Optional[bool] = _compare_version("torchvision", operator.ge, "0.8.0")
0:   File "/usr/local/lib/python3.10/dist-packages/torchmetrics/utilities/imports.py", line 79, in _compare_version
0:     if not _module_available(package):
0:   File "/usr/local/lib/python3.10/dist-packages/torchmetrics/utilities/imports.py", line 60, in _module_available
0:     module = import_module(module_names[0])
0:   File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
0:     return _bootstrap._gcd_import(name[level:], package, level)
0:   File "/usr/local/lib/python3.10/dist-packages/torchvision/__init__.py", line 6, in <module>
0:     from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils
0:   File "/usr/local/lib/python3.10/dist-packages/torchvision/_meta_registrations.py", line 164, in <module>
0:     def meta_nms(dets, scores, iou_threshold):
0:   File "/leonardo/home/userexternal/frekatha/.local/lib/python3.10/site-packages/torch/library.py", line 467, in inner
0:     handle = entry.abstract_impl.register(func_to_register, source)
0:   File "/leonardo/home/userexternal/frekatha/.local/lib/python3.10/site-packages/torch/_library/abstract_impl.py", line 30, in register
0:     if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
0: RuntimeError: operator torchvision::nms does not exist
0: waiting for server (0.0.0.0:1424) to be up

Torchvision is clearly installed, pip list | grep torch:

open-clip-torch           2.24.0
pytorch-lightning         2.0.7
pytorch-quantization      2.1.2
torch                     2.3.0
torch-ema                 0.3
torch-tensorrt            2.2.0a0
torchdata                 0.7.0a0
torchdiffeq               0.2.3
torchmetrics              0.9.1
torchprofile              0.0.4
torchsde                  0.2.6
torchtext                 0.17.0a0
torchvision               0.17.0a0

Are you modifying the image nvcr.io/nvidia/nemo:24.01.framework and reinstalling libraries before running inference? The utility functions making module version comparisons in pytorch_lightning appear to suck at handling alpha versions of libraries.

I had somehow a different Pytorch installed in .local/lib/python3.10/site-packages without being aware of it. Since my home directory was mounted in the container, the container prioritized this library over the Pytorch version installed in the container.

This caused a mismatch between torch and torchvision, leading to this error.

My bad!

Lauler changed discussion status to closed

Sign up or log in to comment