Running into issues when trying to run with TGI

#6
by viraniaman - opened

Llama variants seem to have frequently faced these issues. Relevant discussions:
Issue 2 in https://github.com/huggingface/text-generation-inference/issues/769
https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ/discussions/5

My input and output:

docker run --gpus all -e GPTQ_BITS=4 -e GPTQ_GROUPSIZE=32  -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.0.3 --model-id $model --revision $revision --quantize gptq

2023-09-08T02:41:23.437900Z  INFO text_generation_launcher: Args { model_id: "TheBloke/CodeLlama-34B-Instruct-GPTQ", revision: Some("gptq-4bit-32g-actorder_True"), validation_workers: 2, sharded: None, num_shard: None, quantize: Some(Gptq), dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "7c73057b8d56", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-09-08T02:41:23.438009Z  INFO download: text_generation_launcher: Starting download process.
2023-09-08T02:41:25.850890Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2023-09-08T02:41:26.240869Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2023-09-08T02:41:26.241382Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-09-08T02:41:30.648305Z  INFO text_generation_launcher: Using exllama kernels

2023-09-08T02:41:30.654582Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 81, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 195, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 147, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 187, in get_model
    return FlashLlama(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 68, in __init__
    model = FlashLlamaForCausalLM(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 461, in __init__
    self.model = FlashLlamaModel(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 399, in __init__
    [
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 400, in <listcomp>
    FlashLlamaLayer(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 336, in __init__
    self.self_attn = FlashLlamaAttention(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 218, in __init__
    self.o_proj = TensorParallelRowLinear.load(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 361, in load
    get_linear(weight, bias, config.quantize),
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 233, in get_linear
    linear = Ex4bitLinear(qweight, qzeros, scales, g_idx, bias, bits, groupsize)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/gptq/exllama.py", line 114, in __init__
    assert groupsize == self.groupsize
AssertionError

2023-09-08T02:41:31.147356Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=True`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
You are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 81, in serve
    server.serve(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 195, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 147, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 187, in get_model
    return FlashLlama(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 68, in __init__
    model = FlashLlamaForCausalLM(config, weights)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 461, in __init__
    self.model = FlashLlamaModel(config, weights)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 399, in __init__
    [

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 400, in <listcomp>
    FlashLlamaLayer(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 336, in __init__
    self.self_attn = FlashLlamaAttention(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 218, in __init__
    self.o_proj = TensorParallelRowLinear.load(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 361, in load
    get_linear(weight, bias, config.quantize),

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 233, in get_linear
    linear = Ex4bitLinear(qweight, qzeros, scales, g_idx, bias, bits, groupsize)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/gptq/exllama.py", line 114, in __init__
    assert groupsize == self.groupsize

AssertionError
 rank=0
2023-09-08T02:41:31.245302Z ERROR text_generation_launcher: Shard 0 failed to start
2023-09-08T02:41:31.245320Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart

Looks like a change in this file will fix the issue according to this comment, but I am not sure what exactly would need to be done. If someone can point me in the right direction I can raise a PR.

I am a noob at loading and running LLMs, but am an SWE with 5y of exp FWIW. Please let me know what I should do next. Thanks!

Also number of shards seems to make a difference in some cases. Is that expected to be the case here?

viraniaman changed discussion title from Running into common issues when trying to run with TGI to Running into issues when trying to run with TGI

After a bunch of hit and trial, this command worked:

docker run --gpus all -p 8080:80 -e DISABLE_EXLLAMA=True -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --revision $revision --quantize gptq --max-batch-prefill-tokens=1024

But the inference speed is quite slow

Sign up or log in to comment