File size: 9,549 Bytes
6015b31 44c3921 6015b31 6abbf5e 6015b31 44c3921 6015b31 af44f45 3d41055 af44f45 0ba64e0 6015b31 ac59e23 936bb7f 6015b31 936bb7f 6015b31 f8f6a65 6015b31 936bb7f 6015b31 88cd2a9 5c73b29 be0c291 d2764e2 6015b31 f8f6a65 6015b31 f8f6a65 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 |
---
library_name: transformers
license: other
license_name: nvidia-open-model-license
license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
pipeline_tag: text-generation
---
# Hymba-1.5B-Base
<p align="center">
๐พ <a href="https://github.com/NVlabs/hymba">Github</a>   |    ๐ <a href="https://arxiv.org/abs/2411.13676">Paper</a> |    ๐ <a href="https://developer.nvidia.com/blog/hymba-hybrid-head-architecture-boosts-small-language-model-performance/">Blog</a>  
</p>
## Model Overview
Hymba-1.5B-Base is a base text-to-text model that can be adopted for a variety of natural language generation tasks.
The model has hybrid architecture with Mamba and Attention heads running in parallel. Meta tokens, a set of learnable tokens prepended to every prompt, help improve the efficacy of the model. The model shares KV cache between 2 layers and between heads in a single layer. 90% of attention layers are sliding window attention.
This model is ready for commercial use.
**[Caution] During generation, the batch size needs to be 1. Our current implementation does not fully support padding of Meta tokens + SWA; this is a work in progress. Training and pre-filling support any batch size.**
**Model Developer:** NVIDIA
**Model Dates:** Hymba-1.5B-Base was trained between September 1, 2024 and November 10th, 2024.
**License:**
This model is released under the [NVIDIA Open Model License Agreement](https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf).
## Model Architecture
Hymba-1.5B-Base has a model embedding size of 1600, 25 attention heads, and an MLP intermediate dimension of 5504, with 32 layers in total, 16 SSM states, 3 full attention layers, the rest are sliding window attention. Unlike the standard Transformer, each attention layer in Hymba has a hybrid combination of standard attention heads and Mamba heads in parallel. Additionally, it uses Grouped-Query Attention (GQA) and Rotary Position Embeddings (RoPE).
Features of this architecture:
- Fuse attention heads and SSM heads within the same layer, offering parallel and complementary processing of the same inputs.
<div align="center">
<img src="https://huggingface.co/nvidia/Hymba-1.5B-Base/resolve/main/images/module.png" alt="Hymba Module" width="600">
</div>
- Introduce meta tokens that are prepended to the input sequences and interact with all subsequent tokens, thus storing important information and alleviating the burden of "forced-to-attend" in attention.
- Integrate with cross-layer KV sharing and global-local attention to further boost memory and computation efficiency.
<div align="center">
<img src="https://huggingface.co/nvidia/Hymba-1.5B-Base/resolve/main/images/macro_arch.png" alt="Hymba Model" width="600">
</div>
## Performance Highlights
- Hymba-1.5B-Base outperforms all sub-2B public models.
<div align="center">
<img src="https://huggingface.co/nvidia/Hymba-1.5B-Base/resolve/main/images/performance1.png" alt="Compare with SoTA Small LMs" width="800">
</div>
<div align="center">
<img src="https://huggingface.co/nvidia/Hymba-1.5B-Base/resolve/main/images/performance2.png" alt="Compare with SoTA Small LMs" width="800">
</div>
## Model Usage
### Step 1: Environment Setup
Since Hymba-1.5B-Base employs [FlexAttention](https://pytorch.org/blog/flexattention/), which relies on Pytorch2.5 and other related dependencies, we provide two ways to setup the environment:
- **[Local install]** Install the related packages using our provided `setup.sh` (support CUDA 12.1/12.4):
```
wget --header="Authorization: Bearer YOUR_HF_TOKEN" https://huggingface.co/nvidia/Hymba-1.5B-Base/resolve/main/setup.sh
bash setup.sh
```
- **[Docker]** A docker image is provided with all of Hymba's dependencies installed. You can download our docker image and start a container using the following commands:
```
docker pull ghcr.io/tilmto/hymba:v1
docker run --gpus all -v /home/$USER:/home/$USER -it ghcr.io/tilmto/hymba:v1 bash
```
### Step 2: Chat with Hymba-1.5B-Base
After setting up the environment, you can use the following script to chat with our Model
```py
from transformers import LlamaTokenizer, AutoModelForCausalLM, AutoTokenizer, AutoModel
import torch
# Load the tokenizer and model
repo_name = "nvidia/Hymba-1.5B-Base"
tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo_name, trust_remote_code=True)
model = model.cuda().to(torch.bfloat16)
# Chat with Hymba
prompt = input()
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
outputs = model.generate(**inputs, max_length=64, do_sample=False, temperature=0.7, use_cache=True)
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(f"Model response: {response}")
```
## Finetuning Hymba
[LMFlow](https://github.com/OptimalScale/LMFlow) is a complete pipeline for fine-tuning large language models.
The following steps provide an example of how to fine-tune the `Hymba-1.5B-Base` models using LMFlow.
1. Using Docker
```
docker pull ghcr.io/tilmto/hymba:v1
docker run --gpus all -v /home/$USER:/home/$USER -it ghcr.io/tilmto/hymba:v1 bash
```
2. Install LMFlow
```
git clone https://github.com/OptimalScale/LMFlow.git
cd LMFlow
conda create -n lmflow python=3.9 -y
conda activate lmflow
conda install mpi4py
pip install -e .
```
3. Fine-tune the model using the following command.
```
cd LMFlow
bash ./scripts/run_finetune_hymba.sh
```
With LMFlow, you can also fine-tune the model on your custom dataset. The only thing you need to do is transform your dataset into the [LMFlow data format](https://optimalscale.github.io/LMFlow/examples/DATASETS.html).
In addition to full-finetuniing, you can also fine-tune hymba efficiently with [DoRA](https://arxiv.org/html/2402.09353v4), [LoRA](https://github.com/OptimalScale/LMFlow?tab=readme-ov-file#lora), [LISA](https://github.com/OptimalScale/LMFlow?tab=readme-ov-file#lisa), [Flash Attention](https://github.com/OptimalScale/LMFlow/blob/main/readme/flash_attn2.md), and other acceleration techniques.
For more details, please refer to the [LMFlow for Hymba](https://github.com/OptimalScale/LMFlow/tree/main/experimental/Hymba) documentation.
## Evaluation
We use [`LM Evaluation Harness`](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the model. The evaluation commands are as follows:
```bash
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
git fetch --all --tags
git checkout tags/v0.4.4 # squad completion task is not compatible with the latest version
cd lm-evaluation-harness
pip install -e .
lm_eval --model hf --model_args pretrained=nvidia/Hymba-1.5B-Base,dtype=bfloat16,trust_remote_code=True \
--tasks mmlu \
--num_fewshot 5 \
--batch_size 1 \
--output_path ./hymba_HF_base_lm-results \
--log_samples
lm_eval --model hf --model_args pretrained=nvidia/Hymba-1.5B-Base,dtype=bfloat16,trust_remote_code=True \
--tasks arc_easy,arc_challenge,piqa,winogrande,hellaswag \
--num_fewshot 0 \
--batch_size 1 \
--output_path ./hymba_HF_base_lm-results \
--log_samples
lm_eval --model hf --model_args pretrained=nvidia/Hymba-1.5B-Base,dtype=bfloat16,trust_remote_code=True \
--tasks squad_completion \
--num_fewshot 1 \
--batch_size 1 \
--output_path ./hymba_HF_base_lm-results \
--log_samples
```
## Limitations
The model was trained on data that contains toxic language, unsafe content, and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.
The testing suggests that this model is susceptible to jailbreak attacks. If using this model in a RAG or agentic setting, we recommend strong output validation controls to ensure security and safety risks from user-controlled model outputs are consistent with the intended use cases.
## Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
## Citation
```
@misc{dong2024hymbahybridheadarchitecturesmall,
title={Hymba: A Hybrid-head Architecture for Small Language Models},
author={Xin Dong and Yonggan Fu and Shizhe Diao and Wonmin Byeon and Zijia Chen and Ameya Sunil Mahabaleshwarkar and Shih-Yang Liu and Matthijs Van Keirsbilck and Min-Hung Chen and Yoshi Suhara and Yingyan Lin and Jan Kautz and Pavlo Molchanov},
year={2024},
eprint={2411.13676},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.13676},
} |