File size: 5,830 Bytes
a808e2a
5db860b
 
 
 
 
 
 
 
 
 
 
 
 
 
a808e2a
4ad5af2
5db860b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6ad73d9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5db860b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
license: cc-by-nc-4.0
base_model: Qwen/Qwen2-7B-Instruct
model-index:
- name: Dolphin 
  results: []
tags:
- RAG
- on-device language model
- Retrieval Augmented Generation
inference: false
space: false
spaces: false
language:
- en
---
# Dolphin: Long Context as a New Modality for on-device RAG

<p align="center">
- <a href="https://www.nexaai.com/models" target="_blank">Nexa Model Hub</a>
- <a href="https://arxiv.org/abs/2404.01744" target="_blank">ArXiv</a>
</p>

<p align="center" width="100%">
  <a><img src="logo.png" alt="nexa-octopus" style="width: 30%; min-width: 300px; display: block; margin: auto;"></a>
</p>

## Overview
Dolphin is a novel approach to accelerate language model inference by treating long context as a new modality, similar to image, audio, and video modalities in vision-language models. This innovative method incorporates a language encoder model to encode context information into embeddings, applying multimodal model concepts to enhance the efficiency of language model inference。 Below are model highlights:
- 🧠 Context as a distinct modality
- 🗜️ Language encoder for context compression
- 🔗 Multimodal techniques applied to language processing
- ⚡ Optimized for energy efficiency and on-device use
- 📜 Specialized for long context understanding

## Model Architecture
Dolphin employs a decoder-decoder framework with two main components:
1. A smaller decoder (0.5B parameters) for transforming information from extensive contexts
2. A larger decoder (7B parameters) for comprehending and generating responses to current queries
3. The architecture also includes a projector to align embeddings between the text encoder and the main decoder.

![Model Architecture](modelstructure.jpg)

## Running the Model

```python
from transformers import AutoTokenizer
from configuration_dolphin import DolphinForCausalLM
import time

AutoConfig.register("dolphin", DolphinConfig)
AutoModelForCausalLM.register(DolphinConfig, DolphinForCausalLM)

MEMORY_SIZE = 32
def inference_instruct(mycontext, device = "cuda:0"):
    import time
    start = time.time()
    generated_token_ids = []
    prompt = " <context>Who and when founded the Shanda group?"
    print("input prompt: " + prompt)
    print("input context: " + mycontext)
    text_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split("<context>")]
    input_ids = (
        torch.tensor(text_chunks[0] + [-1] * MEMORY_SIZE + text_chunks[1], dtype=torch.long)
        .unsqueeze(0)
        .to(device)
    )
    # print(input_ids)
    # to process the context
    context_tokenized = tokenizer(
        mycontext + "".join([f"[memory_{i}]" for i in range(MEMORY_SIZE)]),
        return_tensors="pt",
    )
    context_tokenized = {k: v.to(device) for k, v in context_tokenized.items()}
    # print(context_tokenized["input_ids"])
    context_token_count = (context_tokenized["input_ids"]).shape[1] - MEMORY_SIZE 
    print("length of context: " + str(context_token_count) + " tokens")
    # We conduct a inference process
    for i in range(context_token_count):
        print(f"\rGenerating token {i+1}/{context_token_count}", end="")
        next_token = (
            model(
                input_ids,
                context_input_ids=context_tokenized["input_ids"],
                context_attention_mask=context_tokenized["attention_mask"],
            )
            .logits[:, -1]
            .argmax(-1)
        )
        if next_token.item() == 151643:
            break
        generated_token_ids.append(next_token.item())
        input_ids = torch.cat([input_ids, next_token.unsqueeze(1)], dim=-1)
    print("\noutput: " + tokenizer.decode(generated_token_ids))
    end = time.time()
    print(f"Elapsed time: {end - start:.2f}s")


# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('NexaAIDev/Dolphin', trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained('NexaAIDev/Dolphin', trust_remote_code=True)

# Run inference example
mycontext = "Nexa AI is a Cupertino-based company founded in May 2023 that researches and develops models and tools for on-device AI applications. The company is founded by Alex and Zack. The company is known for its Octopus-series models, which rival large-scale language models in capabilities such as function-calling, multimodality, and action-planning, while remaining efficient and compact for edge device deployment. Nexa AI's mission is to advance on-device AI in collaboration with the global developer community. To this end, the company has created an on-device model hub for users to find, share, and collaborate on open-source AI models optimized for edge devices, as well as an SDK for developers to run and deploy AI models locally"
inference_instruct(mycontext, "who founded Nexa AI?")
inference_instruct(mycontext, "what is the mission of Nexa AI?")
inference_instruct(mycontext, "what is the performance of Octopus V2 and V3?")
inference_instruct(mycontext, "when is Nexa AI founded?") 
```

## Training Process
Dolphin's training involves three stages:
1. Restoration Training: Reconstructing original context from compressed embeddings
2. Continual Training: Generating context continuations from partial compressed contexts
3. Instruction Fine-tuning: Generating responses to queries given compressed contexts

This multi-stage approach progressively enhances the model's ability to handle long contexts and generate appropriate responses.

## Citation
If you use Dolphin in your research, please cite our paper:

```bibtex
@article{dolphin2024,
  title={Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models},
  author={[Author Names]},
  journal={arXiv preprint arXiv:[paper_id]},
  year={2024}
}
```

## Contact
For questions or feedback, please [contact us](octopus@nexa4ai.com)