--- license: llama3.1 base_model: - meta-llama/Meta-Llama-3.1-8B-Instruct tags: - Text Generation - llama3.1 - text-generation-inference - Inference Endpoints - Transformers - Fusion language: - en --- # Llama-3.1-8B-Fusion-6040 ## Overview `Llama-3.1-8B-Fusion-6040` is a mixed model that combines the strengths of two powerful Llama-based models: [arcee-ai/Llama-3.1-SuperNova-Lite](https://huggingface.co/arcee-ai/Llama-3.1-SuperNova-Lite) and [mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated](https://huggingface.co/mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated). The weights are blended in a 6:4 ratio, with 70% of the weights from SuperNova-Lite and 30% from the abliterated Meta-Llama-3.1-8B-Instruct model. **Although it's a simple mix, the model is usable, and no gibberish has appeared**. This is an experiment. Later, I will test the [9:1](https://huggingface.co/huihui-ai/Llama-3.1-8B-Fusion-9010), [8:2](https://huggingface.co/huihui-ai/Llama-3.1-8B-Fusion-8020), [7:3](https://huggingface.co/huihui-ai/Llama-3.1-8B-Fusion-7030), , 6:4, and 5:5 ratios separately to see how much impact they have on the model. ## Model Details - **Base Models:** - [arcee-ai/Llama-3.1-SuperNova-Lite](https://huggingface.co/arcee-ai/Llama-3.1-SuperNova-Lite) (70%) - [mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated](https://huggingface.co/mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated) (30%) - **Model Size:** 8B parameters - **Architecture:** Llama 3.1 - **Mixing Ratio:** 6:4 (SuperNova-Lite:Meta-Llama-3.1-8B-Instruct-abliterated) ## Key Features - **SuperNova-Lite Contributions (70%):** Llama-3.1-SuperNova-Lite is an 8B parameter model developed by Arcee.ai, based on the Llama-3.1-8B-Instruct architecture. - **Meta-Llama-3.1-8B-Instruct-abliterated Contributions (30%):** This is an uncensored version of Llama 3.1 8B Instruct created with abliteration. ## Usage You can use this mixed model in your applications by loading it with Hugging Face's `transformers` library: ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer import time mixed_model_name = "huihui-ai/Llama-3.1-8B-Fusion-6040" # Check if CUDA is available device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Load model and tokenizer mixed_model = AutoModelForCausalLM.from_pretrained(mixed_model_name, device_map=device, torch_dtype=torch.bfloat16) tokenizer = AutoTokenizer.from_pretrained(mixed_model_name) # Ensure the tokenizer has pad_token_id set tokenizer.pad_token_id = tokenizer.eos_token_id # Input loop print("Start inputting text for inference (type 'exit' to quit)") while True: prompt = input("Enter your prompt: ") if prompt.lower() == "exit": print("Exiting inference loop.") break # Inference phase: Generate text using the modified model chat = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt} ] # Prepare input data input_ids = tokenizer.apply_chat_template( chat, tokenize=True, add_generation_prompt=True, return_tensors="pt" ).to(device) # Use TextStreamer for streaming output streamer = TextStreamer(tokenizer, skip_special_tokens=True) # Record the start time start_time = time.time() # Generate text and stream output character by character outputs = mixed_model.generate( input_ids, max_new_tokens=8192, do_sample=True, temperature=0.6, top_p=0.9, streamer=streamer # Enable streaming output ) # Record the end time end_time = time.time() # Calculate the number of generated tokens generated_tokens = outputs[0][input_ids.shape[-1]:].shape[0] # Calculate the total time taken total_time = end_time - start_time # Calculate tokens generated per second tokens_per_second = generated_tokens / total_time print(f"\nGenerated {generated_tokens} tokens in total, took {total_time:.2f} seconds, generating {tokens_per_second:.2f} tokens per second.") ``` ## Evaluations We will be submitting this model to the OpenLLM Leaderboard for a more conclusive benchmark - but here are our internal benchmarks using the main branch of lm evaluation harness: