Disclaimer: This model merge has not been thoroughly tested and is experimental. Expect further versions , with improvements, in the coming days.

ZeroXClem/Llama-3-8B-ProLong-SAO-Roleplay-512

ZeroXClem/Llama-3-8B-ProLong-SAO-Roleplay-512 is a powerful, versatile merged model combining the long-context capabilities of Princeton's ProLong model and the rich, immersive roleplay features from Casual-Autopsy's L3-bluuwhale-SAO-MIX. The merge was performed with the mergekit library using advanced configuration to balance efficiency, roleplay fidelity, and long-context capabilities, aiming to provide an unparalleled user experience for extended interactions.

Model Components and Sources

This model is a merge of the following:

princeton-nlp/Llama-3-8B-ProLong-512k-Instruct
Developed by Princeton NLP, ProLong brings long-context capabilities up to 512,000 tokens, optimized for detailed and extended conversations. Continued training on extensive datasets equips it for high-quality retrieval, while offering coherent responses even in lengthy contexts.
Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc
This model introduces roleplay and immersive storytelling, building on creative datasets to create compelling interactions. Role-specific configurations support vibrant and in-depth character simulations.

🧩 Configuration and Merge Details

The model merge was executed using a carefully crafted YAML configuration on MergeKit. Key aspects of the configuration ensure that each component's strengths are preserved while optimizing for performance in complex, long-context scenarios.

YAML Configuration

models:
  - model: princeton-nlp/Llama-3-8B-ProLong-512k-Instruct
    # Base model: optimized for long-context interactions
  - model: Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc
    parameters:
      weight: 0.5  # Emphasizes roleplay elements without overshadowing the base
      density: 0.6  # Retains 60% of the significant parameters from the roleplay model

merge_method: della  # Ensures balanced integration of long-context and roleplay features
base_model: princeton-nlp/Llama-3-8B-ProLong-512k-Instruct
parameters:
  epsilon: 0.05  # Fine-tunes the granularity of pruning, maintaining key model features
  lambda: 1.0  # Harmonizes parameter influence from both models
  normalize: true  # Ensures stable alignment of merged parameters
  int8_mask: true  # Enhances memory efficiency for extended contexts

dtype: float32
out_dtype: bfloat16  # Balances precision and efficiency for versatile deployments

Intended Usage

The ZeroXClem/Llama-3-8B-ProLong-SAO-Roleplay-512K model is designed for:

Extended Conversations: With a 512K token context window, it is ideal for scenarios requiring sustained, cohesive dialogue.
Roleplay and Storytelling: The integration of SAO-themed and roleplay-focused datasets creates a rich and immersive storytelling experience, perfect for applications in interactive fiction, virtual characters, and creative writing.
General Instruction Following: Fine-tuned on UltraChat, the model maintains a helpful and instructive demeanor, making it suitable for Q&A, assistance, and knowledge generation.

📚 Dataset Details for ProLong 8B Training

The ProLong-8B model was rigorously trained with a carefully curated dataset, ensuring versatility across long-context scenarios.

Continued Long-context Training

Data Composition:
- 30% Code Repositories: This includes diverse sources to enhance technical comprehension and code-related dialogue.
- 30% Books: A mix of general and specialized literature to improve narrative and comprehension abilities.
- 3% Textbooks: Technical textbooks for specialized and academic context handling.
- 37% ShortMix: A balanced blend of various online sources for comprehensive topic coverage.
  - ShortMix Components:
    - 27% FineWeb-Edu
    - 27% FineWeb
    - 11% Tulu-v2
    - 11% StackExchange
    - 8% Wikipedia
    - 8% OpenWebMath
    - 8% ArXiv
Training Stages:
- Stage 1 (64K Context Window):
  - Utilized code repositories, books, and textbooks.
  - Training Steps: 20B tokens over approximately 2.2K H100 GPU hours.
- Stage 2 (512K Context Window):
  - Code repositories (50% at 512K length and 50% at 64K length).
  - Books (17% at 512K and 83% at 64K).
  - Textbooks primarily focused on a 512K length.
  - Training Steps: 20B tokens over approximately 12.2K H100 GPU hours.
Optimization and Model Configuration:
- Optimizer: AdamW with a weight decay of 0.1, β₁ = 0.9, and β₂ = 0.95.
- Learning Rate:
  - Stage 1: Initial rate of 1e-5 with 10% warmup and cosine decay to 1e-6.
  - Batch Size: 4M tokens for Stage 1 and 8M tokens for Stage 2.
- Attention Mechanism: Full attention with cross-document attention masking to effectively handle extensive context windows.

Supervised Fine-tuning (SFT)

Data Source:
- UltraChat: A robust dataset with 1B tokens specifically selected to enhance conversational depth and responsiveness.
Optimization:
- Optimizer: AdamW with parameters as above.
- Learning Rate: 2e-5 with a 5% warmup and cosine decay to 2e-6.
- Batch Size: 4M tokens for efficient training on high-context tasks.

Key Features

Long Context Capability: Leveraging Princeton’s ProLong model, this model can handle up to 512K tokens, enabling consistent and detailed responses even in lengthy interactions.
Immersive Roleplay Dynamics: The influence of L3-bluuwhale-SAO-MIX adds depth to character responses, with support for a variety of personalities and nuanced interactions.
Enhanced Memory Efficiency: Configured to utilize int8_mask, which aids in managing larger context sizes efficiently on limited hardware resources.

Acknowledgments

Princeton NLP: For creating the ProLong models, which bring unprecedented long-context handling capabilities to the Llama series.
Casual-Autopsy: For providing F32 quants of L3-bluuwhale-SAO-MIX, a rich roleplay model that adds thematic depth and interaction diversity.
Bluuwhale: For merging L3-SAO-MIX-8B-V1.
Sao10K: For creating these wonderful models, adding rich roleplay models that adds thematic depth and character continuity. SAO10K.

Citation

If you use this model, please consider citing the work of the ProLong developers:

@article{gao2024prolong,
  title={How to Train Long-Context Language Models (Effectively)},
  author={Gao, Tianyu and Wettig, Alexander and Yen, Howard and Chen, Danqi},
  journal={arXiv preprint arXiv:2410.02660},
  year={2024}
}

ZeroXClem
/

Llama-3-8B-ProLong-SAO-Roleplay-512k