File size: 3,525 Bytes
a808e2a 5db860b a808e2a 4ad5af2 5db860b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
---
license: cc-by-nc-4.0
base_model: Qwen/Qwen2-7B-Instruct
model-index:
- name: Dolphin
results: []
tags:
- RAG
- on-device language model
- Retrieval Augmented Generation
inference: false
space: false
spaces: false
language:
- en
---
# Dolphin: Long Context as a New Modality for on-device RAG
<p align="center">
- <a href="https://www.nexaai.com/models" target="_blank">Nexa Model Hub</a>
- <a href="https://arxiv.org/abs/2404.01744" target="_blank">ArXiv</a>
</p>
<p align="center" width="100%">
<a><img src="logo.png" alt="nexa-octopus" style="width: 30%; min-width: 300px; display: block; margin: auto;"></a>
</p>
## Overview
Dolphin is a novel approach to accelerate language model inference by treating long context as a new modality, similar to image, audio, and video modalities in vision-language models. This innovative method incorporates a language encoder model to encode context information into embeddings, applying multimodal model concepts to enhance the efficiency of language model inference。 Below are model highlights:
- 🧠 Context as a distinct modality
- 🗜️ Language encoder for context compression
- 🔗 Multimodal techniques applied to language processing
- ⚡ Optimized for energy efficiency and on-device use
- 📜 Specialized for long context understanding
## Model Architecture
Dolphin employs a decoder-decoder framework with two main components:
1. A smaller decoder (0.5B parameters) for transforming information from extensive contexts
2. A larger decoder (7B parameters) for comprehending and generating responses to current queries
3. The architecture also includes a projector to align embeddings between the text encoder and the main decoder.
![Model Architecture](modelstructure.jpg)
## Running the Model
```python
from transformers import AutoTokenizer
from configuration_dolphin import DolphinForCausalLM
import time
tokenizer = AutoTokenizer.from_pretrained('nexa-collaboration/dolphin_instruct_1M_0805', trust_remote_code=True)
model = DolphinForCausalLM.from_pretrained('nexa-collaboration/dolphin_instruct_1M_0805', trust_remote_code=True)
def inference(input_text):
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
input_text = "Take a selfie for me with front camera"
nexa_query = f"Below is the query from the users, please call the correct function and generate the parameters to call the function.\n\nQuery: {input_text} \n\nResponse:"
start_time = time.time()
result = inference(nexa_query)
print("Dolphin model result:\n", result)
print("Latency:", time.time() - start_time, "s")
```
## Training Process
Dolphin's training involves three stages:
1. Restoration Training: Reconstructing original context from compressed embeddings
2. Continual Training: Generating context continuations from partial compressed contexts
3. Instruction Fine-tuning: Generating responses to queries given compressed contexts
This multi-stage approach progressively enhances the model's ability to handle long contexts and generate appropriate responses.
## Citation
If you use Dolphin in your research, please cite our paper:
```bibtex
@article{dolphin2024,
title={Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models},
author={[Author Names]},
journal={arXiv preprint arXiv:[paper_id]},
year={2024}
}
```
## Contact
For questions or feedback, please [contact us](octopus@nexa4ai.com) |