KeyError: 'model.layers.0.mlp.down_proj.g_idx' ..?

by andreass123 - opened Jun 11

Jun 11

aphroditre cli

aphrodite run ./llama-3-marlin/ --quantization marlin -
-tensor-parallel-size 2 --gpu-memory-utilization 1.0 --kv-cache-dtype fp8 --max-model-len 8192 --host 0.0.0
.0 --port 8888 --served-model-name custom_model

aphrodite info

INFO: CUDA_HOME is not found in the environment. Using /usr/local/cuda as CUDA_HOME.
INFO: Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the
performance. But it may cause slight accuracy drop without scaling factors. FP8_E5M2 (without scaling) is
only supported on cuda version greater than 11.8. On ROCm (AMD GPU), FP8_E4M3 is instead supported for
common inference criteria.
2024-06-11 12:26:02,678 INFO worker.py:1749 -- Started a local Ray instance.
INFO: Initializing the Aphrodite Engine (v0.5.3) with the following config:
INFO: Model = './llama-3-marlin/'
INFO: Speculative Config = None
INFO: DataType = torch.float16
INFO: Model Load Format = auto
INFO: Number of GPUs = 2
INFO: Disable Custom All-Reduce = False
INFO: Quantization Format = marlin
INFO: Context Length = 8192
INFO: Enforce Eager Mode = True
INFO: KV Cache Data Type = fp8
INFO: KV Cache Params Path = None
INFO: Device = cuda
INFO: Guided Decoding Backend = DecodingConfig(guided_decoding_backend='outlines')
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING: Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO: Using FlashAttention backend.
(RayWorkerAphrodite pid=78653) WARNING: Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(RayWorkerAphrodite pid=78653) INFO: Using FlashAttention backend.
INFO: Aphrodite is using nccl==2.20.5
(RayWorkerAphrodite pid=78653) INFO: Aphrodite is using nccl==2.20.5
INFO: reading GPU P2P access cache from
/home/.config/aphrodite/gpu_p2p_access_cache_for_0,1.json
WARNING: Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed.
To silence this warning, specify disable_custom_all_reduce=True explicitly.
(RayWorkerAphrodite pid=78653) INFO: reading GPU P2P access cache from
(RayWorkerAphrodite pid=78653) /home/.config/aphrodite/gpu_p2p_access_cache_for_0,1.json
(RayWorkerAphrodite pid=78653) WARNING: Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed.
(RayWorkerAphrodite pid=78653) To silence this warning, specify disable_custom_all_reduce=True explicitly.

-- ERROR --

[rank0]: Traceback (most recent call last):
[rank0]: File "/home/miniforge3/envs/aphrodite/bin/aphrodite", line 8, in
[rank0]: sys.exit(main())
[rank0]: ^^^^^^
[rank0]: File "/home/miniforge3/envs/aphrodite/lib/python3.11/site-packages/aphrodite/endpoints/cli.py", line 25, in main
[rank0]: args.func(args)
[rank0]: File "/home/miniforge3/envs/aphrodite/lib/python3.11/site-packages/aphrodite/endpoints/openai/api_server.py", line 519, in run_server
[rank0]: engine = AsyncAphrodite.from_engine_args(engine_args)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/miniforge3/envs/aphrodite/lib/python3.11/site-packages/aphrodite/engine/async_aphrodite.py", line 358, in from_engine_args
[rank0]: engine = cls(engine_config.parallel_config.worker_use_ray,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/miniforge3/envs/aphrodite/lib/python3.11/site-packages/aphrodite/engine/async_aphrodite.py", line 323, in init
[rank0]: self.engine = self._init_engine(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/miniforge3/envs/aphrodite/lib/python3.11/site-packages/aphrodite/engine/async_aphrodite.py", line 429, in _init_engine
[rank0]: return engine_class(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/miniforge3/envs/aphrodite/lib/python3.11/site-packages/aphrodite/engine/aphrodite_engine.py", line 131, in init
[rank0]: self.model_executor = executor_class(
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: File "/home/miniforge3/envs/aphrodite/lib/python3.11/site-packages/aphrodite/executor/executor_base.py", line 39, in init
[rank0]: self._init_executor()
[rank0]: File "/home/miniforge3/envs/aphrodite/lib/python3.11/site-packages/aphrodite/executor/ray_gpu_executor.py", line 45, in _init_executor
[rank0]: self._init_workers_ray(placement_group)
[rank0]: File "/home/miniforge3/envs/aphrodite/lib/python3.11/site-packages/aphrodite/executor/ray_gpu_executor.py", line 193, in _init_workers_ray
[rank0]: self._run_workers(
[rank0]: File "/home/miniforge3/envs/aphrodite/lib/python3.11/site-packages/aphrodite/executor/ray_gpu_executor.py", line 309, in _run_workers
[rank0]: driver_worker_output = getattr(self.driver_worker,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/miniforge3/envs/aphrodite/lib/python3.11/site-packages/aphrodite/task_handler/worker.py", line 125, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/home/miniforge3/envs/aphrodite/lib/python3.11/site-packages/aphrodite/task_handler/model_runner.py", line 179, in load_model
[rank0]: self.model = get_model(
[rank0]: ^^^^^^^^^^
[rank0]: File "/home/miniforge3/envs/aphrodite/lib/python3.11/site-packages/aphrodite/modeling/loader.py", line 103, in get_model
[rank0]: model.load_weights(model_config.model, model_config.download_dir,
[rank0]: File "/home/miniforge3/envs/aphrodite/lib/python3.11/site-packages/aphrodite/modeling/models/llama.py", line 497, in load_weights
[rank0]: param = params_dict[name]
[rank0]: ~~~~~~~~~~~^^^^^^
[rank0]: KeyError: 'model.layers.0.mlp.down_proj.g_idx'

--
I just downloaded marlin model and run to aphrodite engine.
I tried AutoGPTQ too.

kuotient

allganize org Jun 12

This model uses gptq_marlin, which is comparable to Marlin. vllm will automatically convert this gptq_marlin format to marlin upon initialization. Please try vllm now, as we plan to switch to the actual marlin format soon. Apologies for any confusion.
ref: https://github.com/vllm-project/vllm/issues/5080

config.json:

"quantization_config": {
    "bits": 4,
    "damp_percent": 0.01,
    "desc_act": false,
    "group_size": 128,
    "is_marlin_format": true,
    "model_file_base_name": null,
    "model_name_or_path": null,
    "quant_method": "gptq",
    "static_groups": false,
    "sym": true,
    "true_sequential": true
  },

andreass123

Jun 12

vLLM not working too...

vLLM/lib/python3.9/site-packages/vllm/model_executor/models/llama.py", line 427, in load_weights
[rank0]: param = params_dict[name]
[rank0]: KeyError: 'model.layers.0.mlp.down_proj.g_idx'
^C

kuotient

allganize org Jun 20

Hi andreass123, Due to the difficulties with version dependencies management, We just added a GPTQ version of the model. Sorry for inconvenience!
https://huggingface.co/allganize/Llama-3-Alpha-Ko-8B-Instruct-GPTQ

kuotient changed discussion status to closed Jun 20

andreass123

Jun 22

thx!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment