question about merging chinese-alpaca-33b with SuperHOT 8K lora

by minlik - opened Jul 6, 2023

Jul 6, 2023

Chinese alpaca 33b was trained with additional chinese tokenizer
Kaio Ken's SuperHOT 8K was trained with the original llama tokenizer
Is it ok to merge them together?

TheBloke

Owner Jul 7, 2023

Hmm, actually I don't know! I didn't look at that, I just did the merge.

Have you tried it, does it produce OK results?

charleyzhuyi

Jul 7, 2023

it is producing some garbage when I tested it.....something like 你好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好好

TheBloke

Owner Jul 7, 2023

•

edited Jul 7, 2023

This can happen with any SuperHOT model when you don't have the context size set correctly. How did you test it? If using ExLlama, make sure context size is >2048 and compress_emb is set appropriately, eg context_size 4096 and compress 2

If using AutoGPTQ, make sure you used trust_remote_code=True

minlik

Jul 7, 2023

The Alpaca template should be used for inferecing.

And I tested the model with some simple scenarios, and it works fine with AutoGPTQ. I will further test the model's ability with longer context sizes and more complex scenarios.

I think It's interesting that different Lora adapters trained on different base models with different tokenizers can be merged successfully

charleyzhuyi

Jul 8, 2023

@TheBloke thank you for your reply, I indeed forgot to set trust_remote_code=True in AutoGPTQ. After set it, I am able to get correct output now.

For ExLlama, I also tested with 8192 context size and 4 compress_pos_emb and able to get correct output. ( and it is much faster than autoGPTQ!)

P.S. Normall, I perfer ExLlama than AutoGPTQ since the former one's performance is much better. For But large mode that can not fit into VRAM. like this 33B-8K, AutoGPTQ is my only choice. (since EXLLama will give CUDA out of memory if load this model with 8k context)

However, today when i retested it using oobabooga text-generation-web-ui , I found out I can even load this model with 8k context by using ExLlama only under Windows. The windows task manager shows there is an additional 16GB “Shared GPU memory" besides the 24GB Dedicated GPU memory, so once the dedicated GPU memory has been filled, it automatically starting to use Shared GPU memory.

This only works under Windows . In Linux, I still get "CUDA out of memory" when I load this model with 8K context using ExLLama.

I did a quick search and it seems that https://www.reddit.com/r/LocalLLaMA/comments/14fb9c0/last_nvidia_drivers_let_you_use_the_shared_memory/ mentioned the latest someone got it work on linux as well, but not my case though.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment