run with vllm

#4
by kuliev-vitaly - opened

What is correct way to run model with vllm? I have exception: gptq quantization not supported

Neural Magic org

What GPU are you using? GPTQ quantization requires compute capability 75 at least I believe

I have 4x3090ti. Another gptq models run successfully. For example 'cortecs/Meta-Llama-3-70B-Instruct-GPTQ-8b'

sudo docker run --ipc=host --log-opt max-size=10m --log-opt max-file=1 --rm -it --gpus '"device=0,1,2,3"' -p 9000:8000 --mount type=bind,source=/home/me/.cache,target=/root/.cache vllm/vllm-openai:v0.5.3.post1 --model neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a16 --tensor-parallel-size 4 --gpu-memory-utilization 0.92 -q gptq

Here is exception:
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 317, in
run_server(args)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 231, in run_server
if llm_engine is not None else AsyncLLMEngine.from_engine_args(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 457, in from_engine_args
engine_config = engine_args.create_engine_config()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/arg_utils.py", line 699, in create_engine_config
model_config = ModelConfig(
File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 179, in init
self._verify_quantization()
File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 227, in _verify_quantization
raise ValueError(
ValueError: Quantization method specified in the model config (compressed-tensors) does not match the quantization method specified in the quantization argument (gptq).

Neural Magic org

Please remove -q gptq, there is no need to specify the quantization. vLLM will automatically detect what is best to run with

sudo docker run --ipc=host --log-opt max-size=10m --log-opt max-file=1 --rm -it --gpus '"device=0,1,2,3"' -p 9000:8000 --mount type=bind06:43:14 [507/507]
cache,target=/root/.cache vllm/vllm-openai:v0.5.3.post1 --model neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a16 --tensor-parallel-size 4 --gpu-memory-utilization 0.92
[sudo] password for me: INFO 08-01 06:42:33 api_server.py:219] vLLM API server version 0.5.3.post1
INFO 08-01 06:42:33 api_server.py:220] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certf
ile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a16', tokenizer=None, skip_
tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='a
uto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None,
worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_pre
fix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=
0.92, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None,
rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokeniz
er_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_sca
ling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', sc
heduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculativ
e_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejectio
n_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model
_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use
_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 08-01 06:42:34 config.py:246] compressed-tensors quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 08-01 06:42:34 config.py:715] Defaulting to use mp for distributed inference
WARNING 08-01 06:42:34 arg_utils.py:762] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with
some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False. INFO 08-01 06:42:34 config.py:806] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 08-01 06:42:34 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a16', speculative_config=None, tokenizer='neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a16', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scali
ng=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, qu
antization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_t
races_endpoint=None), seed=0, served_model_name=neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a16, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-01 06:42:34 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=77) INFO 08-01 06:42:35 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=78) INFO 08-01 06:42:35 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=76) INFO 08-01 06:42:35 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 08-01 06:42:36 utils.py:784] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=76) INFO 08-01 06:42:36 utils.py:784] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=76) INFO 08-01 06:42:36 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 08-01 06:42:36 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=77) INFO 08-01 06:42:36 utils.py:784] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=78) INFO 08-01 06:42:36 utils.py:784] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=77) INFO 08-01 06:42:36 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=78) INFO 08-01 06:42:36 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=76) WARNING 08-01 06:42:37 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs
. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 08-01 06:42:37 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning,
specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=77) WARNING 08-01 06:42:37 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs
. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=78) WARNING 08-01 06:42:37 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs
. To silence this warning, specify disable_custom_all_reduce=True explicitly. INFO 08-01 06:42:37 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=<vllm.distri
buted.device_communicators.shm_broadcast.ShmRingBuffer object at 0x73a5eb89dde0>, local_subscribe_port=47399, local_sync_port=35421, remote_subscribe_port=None, remote_sync_port=None)
INFO 08-01 06:42:37 model_runner.py:680] Starting to load model neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a16... (VllmWorkerProcess pid=78) INFO 08-01 06:42:37 model_runner.py:680] Starting to load model neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a16...
(VllmWorkerProcess pid=76) INFO 08-01 06:42:37 model_runner.py:680] Starting to load model neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a16...
(VllmWorkerProcess pid=77) INFO 08-01 06:42:37 model_runner.py:680] Starting to load model neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a16...
INFO 08-01 06:42:38 weight_utils.py:223] Using model weights format ['
.safetensors']
(VllmWorkerProcess pid=78) INFO 08-01 06:42:38 weight_utils.py:223] Using model weights format ['.safetensors']
(VllmWorkerProcess pid=76) INFO 08-01 06:42:38 weight_utils.py:223] Using model weights format ['
.safetensors']
(VllmWorkerProcess pid=77) INFO 08-01 06:42:38 weight_utils.py:223] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/15 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 7% Completed | 1/15 [00:01<00:22, 1.63s/it]
Loading safetensors checkpoint shards: 13% Completed | 2/15 [00:03<00:24, 1.88s/it]
Loading safetensors checkpoint shards: 20% Completed | 3/15 [00:05<00:23, 1.97s/it]
Loading safetensors checkpoint shards: 27% Completed | 4/15 [00:07<00:22, 2.06s/it]
Loading safetensors checkpoint shards: 33% Completed | 5/15 [00:10<00:21, 2.16s/it]
Loading safetensors checkpoint shards: 40% Completed | 6/15 [00:12<00:19, 2.17s/it]
Loading safetensors checkpoint shards: 47% Completed | 7/15 [00:14<00:17, 2.22s/it]
Loading safetensors checkpoint shards: 53% Completed | 8/15 [00:17<00:15, 2.25s/it]
Loading safetensors checkpoint shards: 60% Completed | 9/15 [00:18<00:12, 2.07s/it]
Loading safetensors checkpoint shards: 67% Completed | 10/15 [00:21<00:10, 2.14s/it]
Loading safetensors checkpoint shards: 73% Completed | 11/15 [00:23<00:08, 2.20s/it]
Loading safetensors checkpoint shards: 80% Completed | 12/15 [00:25<00:06, 2.27s/it]
Loading safetensors checkpoint shards: 87% Completed | 13/15 [00:28<00:04, 2.25s/it]
Loading safetensors checkpoint shards: 93% Completed | 14/15 [00:30<00:02, 2.32s/it] Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:33<00:00, 2.37s/it]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:33<00:00, 2.20s/it]
INFO 08-01 06:43:12 model_runner.py:692] Loading model weights took 16.9568 GB (VllmWorkerProcess pid=77) INFO 08-01 06:43:12 model_runner.py:692] Loading model weights took 16.9568 GB
(VllmWorkerProcess pid=76) INFO 08-01 06:43:12 model_runner.py:692] Loading model weights took 16.9568 GB
(VllmWorkerProcess pid=78) INFO 08-01 06:43:12 model_runner.py:692] Loading model weights took 16.9568 GB
[rank0]: Traceback (most recent call last):
[rank0]: File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]: return _run_code(code, main_globals, None,
[rank0]: File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]: exec(code, run_globals)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 317, in
[rank0]: run_server(args)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 231, in run_server
[rank0]: if llm_engine is not None else AsyncLLMEngine.from_engine_args(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 380, in init
[rank0]: self.engine = self._init_engine(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 547, in _init_engine
[rank0]: return engine_class(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 265, in init
[rank0]: self._initialize_kv_caches()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 364, in _initialize_kv_caches
[rank0]: self.model_executor.determine_num_available_blocks())
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 38, in determine_num_available_blocks
[rank0]: num_blocks = self._run_workers("determine_num_available_blocks", ) [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 178, in _run_workers
[rank0]: driver_worker_output = driver_worker_method(*args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks
[rank0]: self.model_runner.profile_run()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 896, in profile_run
[rank0]: self.execute_model(model_input, kv_caches, intermediate_tensors)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1314, in execute_model
[rank0]: hidden_or_intermediate_states = model_executable(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 422, in forward
[rank0]: model_output = self.model(input_ids, positions, kv_caches,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl [rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 322, in forward [rank0]: hidden_states, residual = layer(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 255, in forward
[rank0]: hidden_states = self.mlp(hidden_states)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 89, in forward
[rank0]: x, _ = self.down_proj(x)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl 06:43:14 [361/507]
[rank0]: return forward_call(*args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 783, in forward
[rank0]: output_parallel = self.quant_method.apply(self, [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 320, in apply
[rank0]: return scheme.apply_weights(layer, x, bias=bias) [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_wNa16.py", line 161,
in apply_weights
[rank0]: return apply_gptq_marlin_linear(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 251, in apply_gptq_marlin_linear
[rank0]: output = ops.gptq_marlin_gemm(reshaped_x,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 34, in wrapper
[rank0]: return fn(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 291, in gptq_marlin_gemm
[rank0]: return torch.ops._C.gptq_marlin_gemm(a, b_q_weight, b_scales, b_zeros,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/ops.py", line 854, in call
[rank0]: return self
._op(*args, **(kwargs or {}))
[rank0]: RuntimeError: CUDA error: an illegal memory access was encountered
[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank0]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
(VllmWorkerProcess pid=77) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method determine_num_availabl
e_blocks: CUDA error: an illegal memory access was encountered
(VllmWorkerProcess pid=77) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorkerProcess pid=77) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] For debugging consider passing CUDA_LAUNCH_BLOCKING=1. (VllmWorkerProcess pid=77) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
(VllmWorkerProcess pid=77) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] , Traceback (most recent call last): (VllmWorkerProcess pid=77) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils
.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=77) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=77) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 11
5, in decorate_context
(VllmWorkerProcess pid=77) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=77) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 179, in
determine_num_available_blocks
(VllmWorkerProcess pid=77) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] self.model_runner.profile_run()
(VllmWorkerProcess pid=77) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 11
5, in decorate_context
(VllmWorkerProcess pid=77) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=77) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 8
96, in profile_run
(VllmWorkerProcess pid=76) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method determine_num_availabl
e_blocks: CUDA error: an illegal memory access was encountered
(VllmWorkerProcess pid=77) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=78) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method determine_num_availabl
e_blocks: CUDA error: an illegal memory access was encountered
(VllmWorkerProcess pid=77) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 11
5, in decorate_context
(VllmWorkerProcess pid=76) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] CUDA kernel errors might be asynchronously reported at some other API call, so the s
tacktrace below might be incorrect. (VllmWorkerProcess pid=77) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=78) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorkerProcess pid=76) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] For debugging consider passing CUDA_LAUNCH_BLOCKING=1. (VllmWorkerProcess pid=77) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1
314, in execute_model
(VllmWorkerProcess pid=78) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
(VllmWorkerProcess pid=76) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
(VllmWorkerProcess pid=77) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=78) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
(VllmWorkerProcess pid=76) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] , Traceback (most recent call last):
(VllmWorkerProcess pid=77) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 15
32, in _wrapped_call_impl
(VllmWorkerProcess pid=78) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] , Traceback (most recent call last):
(VllmWorkerProcess pid=76) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils
.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=77) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=78) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils
.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=76) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=77) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 15
41, in _call_impl
(VllmWorkerProcess pid=78) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=76) ERROR 08-01 06:43:14 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 11
5, in decorate_context

lot of the same errors here

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x73a64f644897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x73a64f5f4b25 in /usr/local/lib/python3.10/dist-packages
/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x73a64f71c718 in /usr/local/lib/python3.10/dist-packages/tor
ch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x73a6509198e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtor
ch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x73a65091d9e8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x73a65092305c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x73a650923dcc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xd6df4 (0x73a69c3dadf4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: + 0x8609 (0x73a69d49c609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x73a69d5d6353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x73a64f644897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: + 0xe32119 (0x73a6505a7119 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd6df4 (0x73a69c3dadf4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0x8609 (0x73a69d49c609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x73a69d5d6353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

I shortened the output to fit within the message limit

Neural Magic org

This is related to a known issue with that release that we have fixed on main. Please install a recent commit from the nightly. I tested this one works with other 3.1 GPTQ models

pip uninstall vllm
pip install https://vllm-wheels.s3.us-west-2.amazonaws.com/c8a7e93273ff4338d6f89f8a63ff16426ac240b8/vllm-0.5.3.post1-cp38-abi3-manylinux1_x86_64.whl
vllm serve neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16 -tp 2

Great! I use docker to start vllm. I will test it with next release in dockerhub

Confirm fix. With docker vllm/vllm-openai v0.5.4 model works

Neural Magic org

Thank you!

mgoin changed discussion status to closed

Sign up or log in to comment