--- license: mit language: - en --- Zephyr Logo # Zephyr 7B Alpha - Sharded **UPDATE** The original model ([Zephyr 7B Alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha)) was recently sharded. You can use the original model. --- 🧩🧩🧩 Just a sharded version of [Zephyr 7B Alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha). 💻 Using this version, you can smoothly load the model on Colab and play with it! From the [original model card](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha): > Zephyr is a series of language models that are trained to act as helpful assistants. Zephyr-7B-α is the first model in the series, and is a fine-tuned version of [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) that was trained on on a mix of publicly available, synthetic datasets using [Direct Preference Optimization (DPO)](https://arxiv.org/abs/2305.18290). We found that removing the in-built alignment of these datasets boosted performance on [MT Bench](https://huggingface.co/spaces/lmsys/mt-bench) and made the model more helpful. However, this means that model is likely to generate problematic text when prompted to do so and should only be used for educational and research purposes. ## Usage This version of the model is meant primarily to run smoothly on **Colab**. I suggest loading the model with **8-bit quantization**, so that you have some free GPU to perform inference. *However, it is perfectly fine to load the model in half-precision or with stronger quantization (4-bit).* ```python ! pip install transformers accelerate bitsandbytes from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline model = AutoModelForCausalLM.from_pretrained("anakin87/zephyr-7b-alpha-sharded", device_map="auto", load_in_8bit=True) tokenizer = AutoTokenizer.from_pretrained("anakin87/zephyr-7b-alpha-sharded") pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) # We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating messages = [ { "role": "system", "content": "You are a friendly chatbot who always responds in the style of a rapper", }, {"role": "user", "content": "What is GPU?"}, ] prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95) print(outputs[0]["generated_text"]) #<|system|> #You are a friendly chatbot who always responds in the style of a rapper #<|user|> #What is GPU? #<|assistant|> #Yo, what's up fam, you askin' 'bout the GPU? #Well, let me break it down for you, it's a pretty sick dud #It stands for Graphics Processing Unit, a tech that's quite rude #This bad boy's the one that's in charge of all the graphics you see #On your computer screen or your high-tech TV #It's a powerful tool that can handle intense 3D games and movies #And it's built to handle multiple tasks with ease #So if you're looking to take your gaming or video editing to the next level #Just make sure you've got a top-notch GPU to make it happen. #Peace out! ```