--- license: other --- # superhot-13b-16k-4bit--1g-safetensors **Note: Maximum sequence length (max_seq_len) and compression factor (compress_pos_emb) need to be set to 16384 (or lower) and 8.** Merged base LLaMA and LoRA with this: https://github.com/tloen/alpaca-lora Base LLaMA 13B: https://huggingface.co/huggyllama/llama-13b SuperHOT 13B 16k no-rlhf-test LoRA: https://huggingface.co/kaiokendev/superhot-13b-16k-no-rlhf-test ``` sh BASE_MODEL=huggyllama_llama-13b LORA=kaiokendev_superhot-13b-16k-no-rlhf-test python export_hf_checkpoint.py ``` Quantized with AutoGPTQ: https://github.com/PanQiWei/AutoGPTQ ``` sh python quant_with_alpaca.py --pretrained_model_dir superhot-13b-16k-safetensors --quantized_model_dir superhot-13b-16k-4bit--1g-safetensors --bits 4 --group_size -1 --desc_act --num_samples 256 --save_and_reload ``` Perplexity: ``` CUDA_VISIBLE_DEVICES=0 python test_benchmark_inference.py \ -d /workspace/models/superhot-13b-16k-4bit--1g-safetensors \ -ppl \ -ppl_ds datasets/wikitext2.txt \ -l 16384 \ -cpe 8 \ -ppl_cn 40 \ -ppl_cs 16384 \ -ppl_ct 16384 -- Perplexity: -- - Dataset: datasets/wikitext2.txt -- - Chunks: 40 -- - Chunk size: 16384 -> 16384 -- - Chunk overlap: 0 -- - Min. chunk size: 50 -- - Key: text -- Tokenizer: /workspace/models/superhot-13b-16k-4bit--1g-safetensors/tokenizer.model -- Model config: /workspace/models/superhot-13b-16k-4bit--1g-safetensors/config.json -- Model: /workspace/models/superhot-13b-16k-4bit--1g-safetensors/4bit.safetensors -- Sequence length: 16384 -- RoPE compression factor: 8.0 -- Tuning: -- --matmul_recons_thd: 8 -- --fused_mlp_thd: 2 -- --sdp_thd: 8 -- Options: ['perplexity'] ** Time, Load model: 3.69 seconds ** Time, Load tokenizer: 0.01 seconds -- Groupsize (inferred): None -- Act-order (inferred): no !! Model has empty group index (discarded) ** VRAM, Model: [cuda:0] 6,974.74 MB -- Loading dataset... -- Testing 21 chunks... ** Perplexity: 7.5462 ```