TGI - RuntimeError: mat1 and mat2 shapes cannot be multiplied (4145x3072 and 1x14155776)

#3
by turjo4nis - opened

config file is using everything default.

Command used to launch TGI:
docker run --gpus all --shm-size 1g -p 8080:80 -v $model:$model ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --hostname 0.0.0.0 --num-shard 1 --trust-remote-code --quantize bitsandbytes-nf4

Full Log output:

2024-05-30T18:00:15.018611Z  INFO text_generation_launcher: Args { model_id: "/media/turjo/hdd/CSE498r/Phi-3-mini-4k-instruct-bnb-4bit/", revision: None, validation_workers: 2, sharded: None, num_shard: Some(1), quantize: Some(BitsandbytesNF4), speculate: None, dtype: None, trust_remote_code: true, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "0.0.0.0", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4 }
2024-05-30T18:00:15.018654Z  INFO text_generation_launcher: Default `max_input_tokens` to 4095
2024-05-30T18:00:15.018658Z  INFO text_generation_launcher: Default `max_total_tokens` to 4096
2024-05-30T18:00:15.018659Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4145
2024-05-30T18:00:15.018660Z  INFO text_generation_launcher: Bitsandbytes doesn't work with cuda graphs, deactivating them
2024-05-30T18:00:15.018662Z  WARN text_generation_launcher: `trust_remote_code` is set. Trusting that model `/media/turjo/hdd/CSE498r/Phi-3-mini-4k-instruct-bnb-4bit/` do not contain malicious code.
2024-05-30T18:00:15.018703Z  INFO download: text_generation_launcher: Starting download process.
2024-05-30T18:00:17.328268Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2024-05-30T18:00:17.621313Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-05-30T18:00:17.621436Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-05-30T18:00:20.720863Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0

2024-05-30T18:00:20.724419Z  INFO shard-manager: text_generation_launcher: Shard ready in 3.102418385s rank=0
2024-05-30T18:00:20.823765Z  INFO text_generation_launcher: Starting Webserver
2024-05-30T18:00:20.865539Z  WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|endoftext|>' was expected to have ID '32000' but was given ID 'None'    
2024-05-30T18:00:20.865552Z  WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|assistant|>' was expected to have ID '32001' but was given ID 'None'    
2024-05-30T18:00:20.865554Z  WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|placeholder1|>' was expected to have ID '32002' but was given ID 'None'    
2024-05-30T18:00:20.865556Z  WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|placeholder2|>' was expected to have ID '32003' but was given ID 'None'    
2024-05-30T18:00:20.865557Z  WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|placeholder3|>' was expected to have ID '32004' but was given ID 'None'    
2024-05-30T18:00:20.865558Z  WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|placeholder4|>' was expected to have ID '32005' but was given ID 'None'    
2024-05-30T18:00:20.865559Z  WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|system|>' was expected to have ID '32006' but was given ID 'None'    
2024-05-30T18:00:20.865561Z  WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|end|>' was expected to have ID '32007' but was given ID 'None'    
2024-05-30T18:00:20.865562Z  WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|placeholder5|>' was expected to have ID '32008' but was given ID 'None'    
2024-05-30T18:00:20.865563Z  WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|placeholder6|>' was expected to have ID '32009' but was given ID 'None'    
2024-05-30T18:00:20.865564Z  WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|user|>' was expected to have ID '32010' but was given ID 'None'    
2024-05-30T18:00:20.865838Z  INFO text_generation_router: router/src/main.rs:253: Using config Some(Mistral)
2024-05-30T18:00:20.865844Z  INFO text_generation_router: router/src/main.rs:260: Using local tokenizer config
2024-05-30T18:00:20.865860Z  WARN text_generation_router: router/src/main.rs:295: no pipeline tag found for model /media/turjo/hdd/CSE498r/Phi-3-mini-4k-instruct-bnb-4bit/
2024-05-30T18:00:20.867713Z  INFO text_generation_router: router/src/main.rs:314: Warming up model
2024-05-30T18:00:21.047898Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 240, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 98, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 768, in warmup
    _, batch, _ = self.generate_token(batch)
  File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 945, in generate_token
    raise e
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 942, in generate_token
    out, speculative_logits = self.forward(batch)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 514, in forward
    logits, speculative_logits = self.model.forward(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 461, in forward
    hidden_states = self.model(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 393, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 321, in forward
    attn_output = self.self_attn(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 194, in forward
    qkv = self.query_key_value(hidden_states)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/layers.py", line 419, in forward
    return self.linear.forward(x)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/layers.py", line 312, in forward
    out = bnb.matmul_4bit(
  File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 579, in matmul_4bit
    return MatMul4Bit.apply(A, B, out, bias, quant_state)
  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 509, in forward
    output = torch.nn.functional.linear(A, F.dequantize_4bit(B, quant_state).to(A.dtype).t(), bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (4145x3072 and 1x14155776)

2024-05-30T18:00:21.116801Z ERROR warmup{max_input_length=4095 max_prefill_tokens=4145 max_total_tokens=4096 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: CANCELLED
Error: Warmup(Generation("CANCELLED"))
2024-05-30T18:00:21.136724Z ERROR text_generation_launcher: Webserver Crashed
2024-05-30T18:00:21.136732Z  INFO text_generation_launcher: Shutting down shards
Error: WebserverFailed
2024-05-30T18:00:21.385222Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=0
Unsloth AI org

Can you try upgrading transformers and see if it still works?

pip install "git+https://github.com/huggingface/transformers.git"

Can you try upgrading transformers and see if it still works?

pip install "git+https://github.com/huggingface/transformers.git"

Didn't work unfortunately. Still getting the same error.

@turjo4nis did you try with the upgrade flag

pip install --upgrade "git+https://github.com/huggingface/transformers.git"
Unsloth AI org

Oh wait I don't think TGI supports 4bit

Sign up or log in to comment