Error loading model

by sm54 - opened Jun 17, 2024

Discussion

sm54

Jun 17, 2024

•

edited Jun 17, 2024

Hello,

I've tried loading the q8_0 quant, and I get this error, using windows text generation webui:

llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'deepseek2'
llama_load_model_from_file: failed to load model
19:48:02-513121 ERROR Failed to load the model.

bartowski

Owner Jun 17, 2024

text gen llama-cpp needs an update

saintjohnny

Jun 18, 2024

Turn off flash attention. This seems to be a known bug.

bartowski

Owner Jun 18, 2024

i would think that's a different error than 'unknown model architecture' but i may be wrong

wrtn2

Jun 19, 2024

Loading some layers to GPU (-ngl) with latest llama.cpp returned "llama_init_from_gpt_params: error: failed to load model".
Using only CPU solved this for me (as mentioned here https://github.com/ggerganov/llama.cpp/pull/7519).
Using flash attention (-fa) gave error: "GGML_ASSERT: ggml.c:5716: ggml_nelements(a) == ne0*ne1".

bartowski

Owner Jun 19, 2024

@wrtn2 you have to disable flash attention for this model to use GPU

wrtn2

Jun 20, 2024

@bartowski Thanks, good to know! In my case the card lacks sufficient RAM, so I'd set llama to load only a subset of the layers on the GPU, which is possible with a number of models, but seems not to be on this one.

paolovic

Jun 20, 2024

Hi all,
could you tell me, how you make it run?

Right now, I am using this cumbersome ipynb

from llama_cpp import Llama

llm = Llama(
      model_path="/DeepSeek-Coder-V2-Lite-Instruct-GGUF/DeepSeek-Coder-V2-Lite-Instruct-Q8_1.gguf",
      n_gpu_layers=-1, # Uncomment to use GPU acceleration
      # seed=1337, # Uncomment to set a specific seed
      n_ctx=8*2048, # Uncomment to increase the context window
)

response = llm.create_chat_completion(
      messages = [
          {"role": "system", "content": "You are a helpful coding assistant."},
          {
              "role": "user",
              "content": "give me quick sort in c++."
          }
      ]
)
print(response["choices"][0]["message"]["content"])

Is there a more convenient way, using huggingface or anything else?

Thank you in advance!

bartowski

Owner Jun 20, 2024

(updated the name to Q8_0_L from Q8_1 just now fyi)

That looks like a fine implementation, is there an issue you're running into or just trying to find a better way?

paolovic

Jun 20, 2024

alright, great, thank you very much!

I am just used to the transformers lingo and thought, maybe there's a better way.

and thanks for the fast reply!

Vitaliy-K-1

Jun 21, 2024

This comment has been hidden

Konstantin89

Jun 22, 2024

Hi. I wanted to test a model up to 8 gigabytes. Downloaded IQ 3. It doesn't work in programs - GPT4 All and LM Studio(((( I'd appreciate it if you could help me get it up and running.

someuser44

Jun 23, 2024

Getting this in LMstudio w flash attention off, tried both w GPU offload and CPU only, same message. Not sure what to do :/ Preset is Deepseek Coder, maybe it needs a deepseek coder instruct preset?

error:
"llama.cpp error: 'error loading model architecture: unknown model architecture: 'deepseek2''"

bartowski

Owner Jun 23, 2024

Update to 0.2.25 from the website, or ignore it if you're already on it

Honeywest

Jul 1, 2024

•

edited Jul 1, 2024

I'm running LM Studio 0.2.26, and it fails. Tried gpt4all, Jan, ollama with Chatollama. Nothing will load this model. tried q4, q8. Flash attention is disabled. How do I use this model?

Ok, I figured it out. If you are using Ollama in a docker:

docker pull ollama/ollama:latest

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment