NeMo
PyTorch
text generation
causal-lm

Does not work with NeMo container

#2
by dataplayer12 - opened

I am using the latest official Nemo docker container (tag 23.02) and installed apex and NeMo as recommended. I still get this error. What am I doing wrong? I'm using an A5000 GPU.

python megatron_gpt_eval.py gpt_model_file=/gpt/GPT-2B-001/GPT-2B-001_bf16_tp1.nemo server=True tensor_model_parallel_size=1 trainer.devices=1
[NeMo W 2023-05-02 09:58:51 experimental:27] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-05-02 09:58:51 experimental:27] Module <class 'nemo.collections.nlp.models.text_normalization_as_tagging.thutmose_tagger.ThutmoseTaggerModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-05-02 09:58:52 experimental:27] Module <class 'nemo.collections.asr.modules.audio_modules.SpectrogramToMultichannelFeatures'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-05-02 09:58:53 nemo_logging:349] /usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
    See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(

Using 16bit None Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[NeMo I 2023-05-02 09:59:11 megatron_init:225] Rank 0 has data parallel group: [0, 1]
[NeMo I 2023-05-02 09:59:11 megatron_init:228] All data parallel group ranks: [[0, 1]]
[NeMo I 2023-05-02 09:59:11 megatron_init:229] Ranks 0 has data parallel rank: 0
[NeMo I 2023-05-02 09:59:11 megatron_init:237] Rank 0 has model parallel group: [0]
[NeMo I 2023-05-02 09:59:11 megatron_init:238] All model parallel group ranks: [[0], [1]]
[NeMo I 2023-05-02 09:59:11 megatron_init:248] Rank 0 has tensor model parallel group: [0]
[NeMo I 2023-05-02 09:59:11 megatron_init:252] All tensor model parallel group ranks: [[0], [1]]
[NeMo I 2023-05-02 09:59:11 megatron_init:253] Rank 0 has tensor model parallel rank: 0
[NeMo I 2023-05-02 09:59:11 megatron_init:267] Rank 0 has pipeline model parallel group: [0]
[NeMo I 2023-05-02 09:59:11 megatron_init:279] Rank 0 has embedding group: [0]
[NeMo I 2023-05-02 09:59:11 megatron_init:285] All pipeline model parallel group ranks: [[0], [1]]
[NeMo I 2023-05-02 09:59:11 megatron_init:286] Rank 0 has pipeline model parallel rank 0
[NeMo I 2023-05-02 09:59:11 megatron_init:287] All embedding group ranks: [[0], [1]]
[NeMo I 2023-05-02 09:59:11 megatron_init:288] Rank 0 has embedding rank: 0
23-05-02 09:59:11 - PID:673 - rank:(0, 0, 0, 0) - microbatches.py:39 - INFO - setting number of micro-batches to constant 1
[NeMo I 2023-05-02 09:59:11 tokenizer_utils:191] Getting SentencePiece with model: /tmp/tmpm0_nvln6/2053796188904e679f7e2754a2a1f280_mt_nlg_plus_multilingual_ja_zh_the_stack_frac_015_256k.model
[NeMo I 2023-05-02 09:59:11 megatron_base_model:205] Padded vocab_size: 256000, original vocab_size: 256000, dummy tokens: 0.
[NeMo I 2023-05-02 09:59:12 megatron_init:225] Rank 0 has data parallel group: [0, 1]
[NeMo I 2023-05-02 09:59:12 megatron_init:228] All data parallel group ranks: [[0, 1]]
[NeMo I 2023-05-02 09:59:12 megatron_init:229] Ranks 0 has data parallel rank: 0
[NeMo I 2023-05-02 09:59:12 megatron_init:237] Rank 0 has model parallel group: [0]
[NeMo I 2023-05-02 09:59:12 megatron_init:238] All model parallel group ranks: [[0], [1]]
[NeMo I 2023-05-02 09:59:12 megatron_init:248] Rank 0 has tensor model parallel group: [0]
[NeMo I 2023-05-02 09:59:12 megatron_init:252] All tensor model parallel group ranks: [[0], [1]]
[NeMo I 2023-05-02 09:59:12 megatron_init:253] Rank 0 has tensor model parallel rank: 0
[NeMo I 2023-05-02 09:59:12 megatron_init:267] Rank 0 has pipeline model parallel group: [0]
[NeMo I 2023-05-02 09:59:12 megatron_init:279] Rank 0 has embedding group: [0]
[NeMo I 2023-05-02 09:59:12 megatron_init:285] All pipeline model parallel group ranks: [[0], [1]]
[NeMo I 2023-05-02 09:59:12 megatron_init:286] Rank 0 has pipeline model parallel rank 0
[NeMo I 2023-05-02 09:59:12 megatron_init:287] All embedding group ranks: [[0], [1]]
[NeMo I 2023-05-02 09:59:12 megatron_init:288] Rank 0 has embedding rank: 0
[NeMo I 2023-05-02 09:59:12 tokenizer_utils:191] Getting SentencePiece with model: /tmp/tmpm0_nvln6/2053796188904e679f7e2754a2a1f280_mt_nlg_plus_multilingual_ja_zh_the_stack_frac_015_256k.model
[NeMo I 2023-05-02 09:59:12 megatron_base_model:205] Padded vocab_size: 256000, original vocab_size: 256000, dummy tokens: 0.
[NeMo E 2023-05-02 09:59:12 common:506] Model instantiation failed!
    Target class:       nemo.collections.nlp.models.language_modeling.megatron_gpt_model.MegatronGPTModel
    Error(s):   precision 16 is not supported. Float16Module (megatron_amp_O2) supports only fp16 and bf16.
    Traceback (most recent call last):
      File "/usr/local/lib/python3.8/dist-packages/nemo/core/classes/common.py", line 485, in from_config_dict
        instance = imported_cls(cfg=config, trainer=trainer)
      File "/usr/local/lib/python3.8/dist-packages/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 128, in __init__
        self.model = Float16Module(module=self.model, precision=cfg.precision)
      File "/usr/local/lib/python3.8/dist-packages/nemo/collections/nlp/modules/common/megatron/module.py", line 278, in __init__
        raise Exception(
    Exception: precision 16 is not supported. Float16Module (megatron_amp_O2) supports only fp16 and bf16.

Error executing job with overrides: ['gpt_model_file=/jaiyam/gpt/GPT-2B-001/GPT-2B-001_bf16_tp1.nemo', 'server=True', 'tensor_model_parallel_size=2', 'trainer.devices=2']
Traceback (most recent call last):
  File "megatron_gpt_eval.py", line 279, in <module>
    main()  # noqa pylint: disable=no-value-for-parameter
  File "/usr/local/lib/python3.8/dist-packages/nemo/core/config/hydra_runner.py", line 105, in wrapper
    _run_hydra(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 389, in _run_hydra
    _run_app(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 452, in _run_app
    run_and_report(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 216, in run_and_report
    raise ex
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 213, in run_and_report
    return func()
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 453, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "megatron_gpt_eval.py", line 182, in main
    model = MegatronGPTModel.restore_from(
  File "/usr/local/lib/python3.8/dist-packages/nemo/core/classes/modelPT.py", line 436, in restore_from
    instance = cls._save_restore_connector.restore_from(
  File "/usr/local/lib/python3.8/dist-packages/nemo/collections/nlp/parts/nlp_overrides.py", line 366, in restore_from
    loaded_params = super().load_config_and_state_dict(
  File "/usr/local/lib/python3.8/dist-packages/nemo/core/connectors/save_restore_connector.py", line 162, in load_config_and_state_dict
    instance = calling_cls.from_config_dict(config=conf, trainer=trainer)
  File "/usr/local/lib/python3.8/dist-packages/nemo/core/classes/common.py", line 507, in from_config_dict
    raise e
  File "/usr/local/lib/python3.8/dist-packages/nemo/core/classes/common.py", line 499, in from_config_dict
    instance = cls(cfg=config, trainer=trainer)
  File "/usr/local/lib/python3.8/dist-packages/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 128, in __init__
    self.model = Float16Module(module=self.model, precision=cfg.precision)
  File "/usr/local/lib/python3.8/dist-packages/nemo/collections/nlp/modules/common/megatron/module.py", line 278, in __init__
    raise Exception(
Exception: precision 16 is not supported. Float16Module (megatron_amp_O2) supports only fp16 and bf16.

I have the same problem

EDIT: I was able to circumvent this problem by editing line 278 in python3.8/site-packages/nemo/collections/nlp/modules/common/megatron from

if precision == 16:

to

if precision == "16":

although I am having a different problem now. I hope this helps you!

Hi, since you are running on an A5000 GPU, can you please try to specify trainer.precision=bf16 when running megatron_gpt_eval.py? We have a fix for trainer.precision=16 in NeMo 1.18 here https://github.com/NVIDIA/NeMo/pull/6543.

@anilozlu Changing the check from an int to a string is not recommended. The error is correct in pointing out that megatron_amp_O2 will not work with bf16.

Thanks @MaximumEntropy this solves the problem.

dataplayer12 changed discussion status to closed

Sign up or log in to comment