text generation inference github issue?
Is there an open issue on https://github.com/huggingface/text-generation-inference tracking the sharded 16 bit floating point problem?
No, not yet. Most likely it is related to the vocabulary size of 32007
.. unfortunately I noticed it too late, it should at least have been rounded number divisible by 16.
I heard from others that TGI didn't start and gave them the error "The choosen size 32007 is not compatible with sharding on 4 shards". caused by an assert
added by this commit: https://github.com/huggingface/text-generation-inference/commit/67347950b7518efeb64c7f99ee360af685b53934
Ah. Thanks for the update. I didn't get the assertion error you mentioned. That is, the model ran for me on when using the docker iamge ghcr.io/huggingface/text-generation-inference:1.0.1. But the responses to questions like "Can you write a python script to calculate Fibonacci sequence?" were only correct in python syntax and english grammar, but not in code algorithm or conversation flow. (E.g. it would answer its own questions) . (For reference, the pythia 12b oasst model gets the python fibonacci sequence correct. )
You could try with nf4
quantization and --sharded false
...
@mgunther I uploaded a new version with resized embedding & lm_head layers which now have a size divisible by 128. Could you please try again if it now works for you?
Yes, the nf4 quantization with no sharding works better, but haven't extensively tested. Currently downloading the new model. Will follow up once tested.
The vllm inference server mentioned in the model card ( https://github.com/vllm-project/vllm ) gets python fibonacci sequence code correct.
It looks like the new oasst lamma 2 model is working well with the tgi docker image 1.0.2 and sharding. ( gets py fib script correct).
Thanks Andreas!
Great, thanks for testing! I will update the readme.