--- base_model: unsloth/meta-llama-3.1-8b-bnb-4bit language: - en - yo - zu - xh - wo - fr - ig - ha - am - ar - so - sw - sn license: apache-2.0 tags: - text-generation-inference - transformers - unsloth - llama - trl datasets: - vutuka/aya_african_alpaca pipeline_tag: text-generation --- # Llama-3.1-8B-african-aya - **Developed by:** vutuka - **License:** apache-2.0 - **Finetuned from model :** unsloth/meta-llama-3.1-8b-bnb-4bit This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library. ## Unsloth Inference (2x Faaaaster) ```sh %%capture # Installs Unsloth, Xformers (Flash Attention) and all other packages! !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" !pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes ``` ```py max_seq_length = 4096 dtype = None load_in_4bit = True # Use 4bit quantization to reduce memory usage. alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: {} ### Input: {} ### Response: {}""" ``` ```py ## Load the Quantize model from unsloth import FastLanguageModel model, tokenizer = FastLanguageModel.from_pretrained( model_name = "vutuka/Llama-3.1-8B-african-aya", max_seq_length = max_seq_length, dtype = dtype, load_in_4bit = load_in_4bit, ) FastLanguageModel.for_inference(model) ``` ```py def llama_african_aya(input: str = "", instruction: str = ""): inputs = tokenizer( [ alpaca_prompt.format( instruction, input, "", ) ], return_tensors = "pt").to("cuda") text_streamer = TextStreamer(tokenizer) # _ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 800) # Generate the response output = model.generate(**inputs, max_new_tokens=1024) # Decode the generated response generated_text = tokenizer.decode(output[0], skip_special_tokens=True) # Extract the response part if needed (assuming the response starts after "### Response:") response_start = generated_text.find("### Response:") + len("### Response:") response = generated_text[response_start:].strip() # Format the response in Markdown # markdown_response = f"{response}" # Render the markdown response # display(Markdown(markdown_response)) return response ``` ```py llama_african_aya( instruction="", input="Àwọn ajínigbé méjì ni wọ́n mú ní Supare Akoko, ṣàlàyé ìtàn náà." ) ``` ## LlamaCPP Code ```sh CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" \ pip install llama-cpp-python ```` ```py from huggingface_hub import hf_hub_download from llama_cpp import Llama ## Download the GGUF model model_name = "vutuka/Llama-3.1-8B-african-aya" model_file = "llama-3.1-8B-african-aya.Q8_0.gguf" model_path = hf_hub_download(model_name, filename=model_file) ## Instantiate model from downloaded file llm = Llama( model_path=model_path, n_ctx=4096, n_gpu_layers=-1, n_batch=512, verbose=False, ) ## Run inference alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: {} ### Input: {} ### Response: {}""" prompt = alpaca_prompt.format( "", "Àwọn ajínigbé méjì ni wọ́n mú ní Supare Akoko, ṣàlàyé ìtàn náà.", "", ) res = llm(prompt) # Res is a dictionary ## Unpack and the generated text from the LLM response dictionary and print it print(res["choices"][0]["text"]) # res is short for result ``` [](https://github.com/unslothai/unsloth)