GGUF
Not-For-All-Audiences
nsfw
Inference Endpoints

Load model text-generation-webui issues

#4
by lazyDataScientist - opened

Running into an issue while using Runpod with a A100. After downloading the model I get this error message for all versions of the model (both Qn_0 and Qn_k).
You mentioned that you got it working on a single A100, did you need to do any extra steps to get the text-generation-webui working with Mixtral models?

Traceback (most recent call last):

File "/workspace/text-generation-webui/modules/ui_model_menu.py", line 209, in load_model_wrapper


shared.model, shared.tokenizer = load_model(selected_model, loader)
File "/workspace/text-generation-webui/modules/models.py", line 89, in load_model


output = load_func_map[loader](model_name)
File "/workspace/text-generation-webui/modules/models.py", line 259, in llamacpp_loader


model, tokenizer = LlamaCppModel.from_pretrained(model_file)
File "/workspace/text-generation-webui/modules/llamacpp_model.py", line 91, in from_pretrained


result.model = Llama(**params)
File "/usr/local/lib/python3.10/dist-packages/llama_cpp_cuda/llama.py", line 923, in init


self._n_vocab = self.n_vocab()
File "/usr/local/lib/python3.10/dist-packages/llama_cpp_cuda/llama.py", line 2184, in n_vocab


return self._model.n_vocab()
File "/usr/local/lib/python3.10/dist-packages/llama_cpp_cuda/llama.py", line 250, in n_vocab


assert self.model is not None
AssertionError

You need to update Transformers on Runpod before launching it, I followed this tutorial : https://youtu.be/WjiX3lCnwUI?si=RnhYQR4eWWfeXCms&t=560
4x13B work on a single A100 using 96% of GPU with FP16, so use this.
For GGUF, I think last Ooba update work, with the last llama.cpp release, but I don't use GGUF in Ooba. Sorry!

tl;dr : If you use an A100 of runpod, use the unquantized files, it work!

Awesome! Thank you! Love the work you have been doing!

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment