Unable to import pytorch_lightning and torchmetrics in recommended NeMo container
I have followed the instructions in your README to perform inference. However, the base image you are recommending is not able to import pytorch_lightning
, nor torchmetrics
without errors.
When I try to launch an inference server with your recommended container nvcr.io/nvidia/nemo:24.01.framework
0: waiting for server (0.0.0.0:1424) to be up
0: Traceback (most recent call last):
0: File "/opt/NeMo/examples/nlp/language_modeling/megatron_gpt_eval.py", line 23, in <module>
0: from pytorch_lightning.trainer.trainer import Trainer
0: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/__init__.py", line 26, in <module>
0: from pytorch_lightning.callbacks import Callback # noqa: E402
0: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/__init__.py", line 14, in <module>
0: from pytorch_lightning.callbacks.batch_size_finder import BatchSizeFinder
0: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/batch_size_finder.py", line 24, in <module>
0: from pytorch_lightning.callbacks.callback import Callback
0: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/callback.py", line 22, in <module>
0: from pytorch_lightning.utilities.types import STEP_OUTPUT
0: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/types.py", line 25, in <module>
0: from torchmetrics import Metric
0: File "/usr/local/lib/python3.10/dist-packages/torchmetrics/__init__.py", line 14, in <module>
0: from torchmetrics import functional # noqa: E402
0: File "/usr/local/lib/python3.10/dist-packages/torchmetrics/functional/__init__.py", line 14, in <module>
0: from torchmetrics.functional.audio.pit import permutation_invariant_training, pit_permutate
0: File "/usr/local/lib/python3.10/dist-packages/torchmetrics/functional/audio/__init__.py", line 14, in <module>
0: from torchmetrics.functional.audio.pit import permutation_invariant_training, pit_permutate # noqa: F401
0: File "/usr/local/lib/python3.10/dist-packages/torchmetrics/functional/audio/pit.py", line 21, in <module>
0: from torchmetrics.utilities.imports import _SCIPY_AVAILABLE
0: File "/usr/local/lib/python3.10/dist-packages/torchmetrics/utilities/__init__.py", line 1, in <module>
0: from torchmetrics.utilities.checks import check_forward_full_state_property # noqa: F401
0: File "/usr/local/lib/python3.10/dist-packages/torchmetrics/utilities/checks.py", line 22, in <module>
0: from torchmetrics.utilities.data import select_topk, to_onehot
0: File "/usr/local/lib/python3.10/dist-packages/torchmetrics/utilities/data.py", line 19, in <module>
0: from torchmetrics.utilities.imports import _TORCH_GREATER_EQUAL_1_6, _TORCH_GREATER_EQUAL_1_7, _TORCH_GREATER_EQUAL_1_8
0: File "/usr/local/lib/python3.10/dist-packages/torchmetrics/utilities/imports.py", line 117, in <module>
0: _TORCHVISION_GREATER_EQUAL_0_8: Optional[bool] = _compare_version("torchvision", operator.ge, "0.8.0")
0: File "/usr/local/lib/python3.10/dist-packages/torchmetrics/utilities/imports.py", line 79, in _compare_version
0: if not _module_available(package):
0: File "/usr/local/lib/python3.10/dist-packages/torchmetrics/utilities/imports.py", line 60, in _module_available
0: module = import_module(module_names[0])
0: File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
0: return _bootstrap._gcd_import(name[level:], package, level)
0: File "/usr/local/lib/python3.10/dist-packages/torchvision/__init__.py", line 6, in <module>
0: from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils
0: File "/usr/local/lib/python3.10/dist-packages/torchvision/_meta_registrations.py", line 164, in <module>
0: def meta_nms(dets, scores, iou_threshold):
0: File "/leonardo/home/userexternal/frekatha/.local/lib/python3.10/site-packages/torch/library.py", line 467, in inner
0: handle = entry.abstract_impl.register(func_to_register, source)
0: File "/leonardo/home/userexternal/frekatha/.local/lib/python3.10/site-packages/torch/_library/abstract_impl.py", line 30, in register
0: if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
0: RuntimeError: operator torchvision::nms does not exist
0: waiting for server (0.0.0.0:1424) to be up
Torchvision is clearly installed, pip list | grep torch
:
open-clip-torch 2.24.0
pytorch-lightning 2.0.7
pytorch-quantization 2.1.2
torch 2.3.0
torch-ema 0.3
torch-tensorrt 2.2.0a0
torchdata 0.7.0a0
torchdiffeq 0.2.3
torchmetrics 0.9.1
torchprofile 0.0.4
torchsde 0.2.6
torchtext 0.17.0a0
torchvision 0.17.0a0
Are you modifying the image nvcr.io/nvidia/nemo:24.01.framework
and reinstalling libraries before running inference? The utility functions making module version comparisons in pytorch_lightning
appear to suck at handling alpha versions of libraries.
I had somehow a different Pytorch installed in .local/lib/python3.10/site-packages
without being aware of it. Since my home directory was mounted in the container, the container prioritized this library over the Pytorch version installed in the container.
This caused a mismatch between torch
and torchvision
, leading to this error.
My bad!