Does not work with NeMo container
I am using the latest official Nemo docker container (tag 23.02) and installed apex
and NeMo
as recommended. I still get this error. What am I doing wrong? I'm using an A5000 GPU.
python megatron_gpt_eval.py gpt_model_file=/gpt/GPT-2B-001/GPT-2B-001_bf16_tp1.nemo server=True tensor_model_parallel_size=1 trainer.devices=1
[NeMo W 2023-05-02 09:58:51 experimental:27] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-05-02 09:58:51 experimental:27] Module <class 'nemo.collections.nlp.models.text_normalization_as_tagging.thutmose_tagger.ThutmoseTaggerModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-05-02 09:58:52 experimental:27] Module <class 'nemo.collections.asr.modules.audio_modules.SpectrogramToMultichannelFeatures'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-05-02 09:58:53 nemo_logging:349] /usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
Using 16bit None Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[NeMo I 2023-05-02 09:59:11 megatron_init:225] Rank 0 has data parallel group: [0, 1]
[NeMo I 2023-05-02 09:59:11 megatron_init:228] All data parallel group ranks: [[0, 1]]
[NeMo I 2023-05-02 09:59:11 megatron_init:229] Ranks 0 has data parallel rank: 0
[NeMo I 2023-05-02 09:59:11 megatron_init:237] Rank 0 has model parallel group: [0]
[NeMo I 2023-05-02 09:59:11 megatron_init:238] All model parallel group ranks: [[0], [1]]
[NeMo I 2023-05-02 09:59:11 megatron_init:248] Rank 0 has tensor model parallel group: [0]
[NeMo I 2023-05-02 09:59:11 megatron_init:252] All tensor model parallel group ranks: [[0], [1]]
[NeMo I 2023-05-02 09:59:11 megatron_init:253] Rank 0 has tensor model parallel rank: 0
[NeMo I 2023-05-02 09:59:11 megatron_init:267] Rank 0 has pipeline model parallel group: [0]
[NeMo I 2023-05-02 09:59:11 megatron_init:279] Rank 0 has embedding group: [0]
[NeMo I 2023-05-02 09:59:11 megatron_init:285] All pipeline model parallel group ranks: [[0], [1]]
[NeMo I 2023-05-02 09:59:11 megatron_init:286] Rank 0 has pipeline model parallel rank 0
[NeMo I 2023-05-02 09:59:11 megatron_init:287] All embedding group ranks: [[0], [1]]
[NeMo I 2023-05-02 09:59:11 megatron_init:288] Rank 0 has embedding rank: 0
23-05-02 09:59:11 - PID:673 - rank:(0, 0, 0, 0) - microbatches.py:39 - INFO - setting number of micro-batches to constant 1
[NeMo I 2023-05-02 09:59:11 tokenizer_utils:191] Getting SentencePiece with model: /tmp/tmpm0_nvln6/2053796188904e679f7e2754a2a1f280_mt_nlg_plus_multilingual_ja_zh_the_stack_frac_015_256k.model
[NeMo I 2023-05-02 09:59:11 megatron_base_model:205] Padded vocab_size: 256000, original vocab_size: 256000, dummy tokens: 0.
[NeMo I 2023-05-02 09:59:12 megatron_init:225] Rank 0 has data parallel group: [0, 1]
[NeMo I 2023-05-02 09:59:12 megatron_init:228] All data parallel group ranks: [[0, 1]]
[NeMo I 2023-05-02 09:59:12 megatron_init:229] Ranks 0 has data parallel rank: 0
[NeMo I 2023-05-02 09:59:12 megatron_init:237] Rank 0 has model parallel group: [0]
[NeMo I 2023-05-02 09:59:12 megatron_init:238] All model parallel group ranks: [[0], [1]]
[NeMo I 2023-05-02 09:59:12 megatron_init:248] Rank 0 has tensor model parallel group: [0]
[NeMo I 2023-05-02 09:59:12 megatron_init:252] All tensor model parallel group ranks: [[0], [1]]
[NeMo I 2023-05-02 09:59:12 megatron_init:253] Rank 0 has tensor model parallel rank: 0
[NeMo I 2023-05-02 09:59:12 megatron_init:267] Rank 0 has pipeline model parallel group: [0]
[NeMo I 2023-05-02 09:59:12 megatron_init:279] Rank 0 has embedding group: [0]
[NeMo I 2023-05-02 09:59:12 megatron_init:285] All pipeline model parallel group ranks: [[0], [1]]
[NeMo I 2023-05-02 09:59:12 megatron_init:286] Rank 0 has pipeline model parallel rank 0
[NeMo I 2023-05-02 09:59:12 megatron_init:287] All embedding group ranks: [[0], [1]]
[NeMo I 2023-05-02 09:59:12 megatron_init:288] Rank 0 has embedding rank: 0
[NeMo I 2023-05-02 09:59:12 tokenizer_utils:191] Getting SentencePiece with model: /tmp/tmpm0_nvln6/2053796188904e679f7e2754a2a1f280_mt_nlg_plus_multilingual_ja_zh_the_stack_frac_015_256k.model
[NeMo I 2023-05-02 09:59:12 megatron_base_model:205] Padded vocab_size: 256000, original vocab_size: 256000, dummy tokens: 0.
[NeMo E 2023-05-02 09:59:12 common:506] Model instantiation failed!
Target class: nemo.collections.nlp.models.language_modeling.megatron_gpt_model.MegatronGPTModel
Error(s): precision 16 is not supported. Float16Module (megatron_amp_O2) supports only fp16 and bf16.
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/nemo/core/classes/common.py", line 485, in from_config_dict
instance = imported_cls(cfg=config, trainer=trainer)
File "/usr/local/lib/python3.8/dist-packages/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 128, in __init__
self.model = Float16Module(module=self.model, precision=cfg.precision)
File "/usr/local/lib/python3.8/dist-packages/nemo/collections/nlp/modules/common/megatron/module.py", line 278, in __init__
raise Exception(
Exception: precision 16 is not supported. Float16Module (megatron_amp_O2) supports only fp16 and bf16.
Error executing job with overrides: ['gpt_model_file=/jaiyam/gpt/GPT-2B-001/GPT-2B-001_bf16_tp1.nemo', 'server=True', 'tensor_model_parallel_size=2', 'trainer.devices=2']
Traceback (most recent call last):
File "megatron_gpt_eval.py", line 279, in <module>
main() # noqa pylint: disable=no-value-for-parameter
File "/usr/local/lib/python3.8/dist-packages/nemo/core/config/hydra_runner.py", line 105, in wrapper
_run_hydra(
File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 389, in _run_hydra
_run_app(
File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 452, in _run_app
run_and_report(
File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 216, in run_and_report
raise ex
File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 213, in run_and_report
return func()
File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 453, in <lambda>
lambda: hydra.run(
File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "megatron_gpt_eval.py", line 182, in main
model = MegatronGPTModel.restore_from(
File "/usr/local/lib/python3.8/dist-packages/nemo/core/classes/modelPT.py", line 436, in restore_from
instance = cls._save_restore_connector.restore_from(
File "/usr/local/lib/python3.8/dist-packages/nemo/collections/nlp/parts/nlp_overrides.py", line 366, in restore_from
loaded_params = super().load_config_and_state_dict(
File "/usr/local/lib/python3.8/dist-packages/nemo/core/connectors/save_restore_connector.py", line 162, in load_config_and_state_dict
instance = calling_cls.from_config_dict(config=conf, trainer=trainer)
File "/usr/local/lib/python3.8/dist-packages/nemo/core/classes/common.py", line 507, in from_config_dict
raise e
File "/usr/local/lib/python3.8/dist-packages/nemo/core/classes/common.py", line 499, in from_config_dict
instance = cls(cfg=config, trainer=trainer)
File "/usr/local/lib/python3.8/dist-packages/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 128, in __init__
self.model = Float16Module(module=self.model, precision=cfg.precision)
File "/usr/local/lib/python3.8/dist-packages/nemo/collections/nlp/modules/common/megatron/module.py", line 278, in __init__
raise Exception(
Exception: precision 16 is not supported. Float16Module (megatron_amp_O2) supports only fp16 and bf16.
I have the same problem
EDIT: I was able to circumvent this problem by editing line 278 in python3.8/site-packages/nemo/collections/nlp/modules/common/megatron from
if precision == 16:
to
if precision == "16":
although I am having a different problem now. I hope this helps you!
Hi, since you are running on an A5000 GPU, can you please try to specify trainer.precision=bf16 when running megatron_gpt_eval.py? We have a fix for trainer.precision=16 in NeMo 1.18 here https://github.com/NVIDIA/NeMo/pull/6543.
@anilozlu
Changing the check from an int to a string is not recommended. The error is correct in pointing out that megatron_amp_O2
will not work with bf16
.
Thanks @MaximumEntropy this solves the problem.